Download population_genetics - Laboratory for Computational Population

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
SNPs,
Haplotypes,
Disease
Associations
Algorithmic Foundations of Computational Biology II
Course 1
Prof. Sorin Istrail
SNPs and the Human Genome:
The Minimal Informative Subset
Overview



Introduction:
SNPs, Haplotypes
A Data Compression Problem:
The Minimum Informative Subset
A New Measure:
Informativeness
A Most Challenging Problem
“None of the [advances of the 20th century medicine] depend
on a deep knowledge of cellular processes or on any
discoveries of molecular biology.
Cancer is still treated by gross physical and chemical assaults
on the offending tissue.
Cardiovascular Disease is treated by surgery whose
anatomical bases go back to the 19th century …
Of course, intimate knowledge of the living cell
and of basic molecular processes may be useful
eventually.”
Lewontin (1991)
Now
“A decade later, molecular biology can
claim very few successes for drugs in
clinical use that were designed ab initio
to control a specific component of a pathway
linked to disease: these include the
monoclonal antibody Herceptin, and the
kinase inhibitor Gleevec.”
Reik, Gregory and Urnov (2002)
Introduction
SNPs, HAPLOTYPES
Single Nucleotide Polymorphism
(SNP)
GATTTAGATCGCGATAGAG
GATTTAGATCTCGATAGAG
A SNP is a position in a genome at which two or more
different bases occur in the population, each with a
frequency >1%.
The two alleles at the site are G and T

The most abundant type of polymorphism
t
tttctccatttgtcgtgacacctttgttgacaccttcatttctgcattctcaattctatttcactggtctatgg
c
g
cagagaacacaaaatatggccagtggcctaaatccagcctactaccttttttttttttttgtaacattttacta
a
g
t
t
acatagccattcccatgtgtttccatgtgtctgggctgcttttgcactctaatggcagagttaagaaattgtag
a
c
c
cagagaccacaatgcctcaaatatttactctacagccctttataaaaacagtgtgccaactcctgatttatgaa
cttatcattatgtcaataccatactgtctttattactgtagttttataagtcatgacatcagataatgtaaatc
g
ctccaactttgtttttaatcaaaagtgttttggccatcctagatatactttgtattgccacataaatttgaaga
a
g
tcagcctgtcagtgtctacaaaatagcatgctaggattttgatagggattgtgtagaatctatagattaattag
c
t
aggagaatgactatcttgacaatactgctgcccctctgtattcgtgggggattggttccacaacaacacccacc
c
ccccactcggcaacccctgaaacccccacatcccccagcttttttcccctgctaccaaaatccatggatgctca

Human Genome contains ~ 3 G
g
agtccatataaaatgccatactatttgcatataacctctgcaatcctcccctatagtttagatcatctctagat
a
basepairs arranged in t46
t
t
tacttataatactaataaaatctaaatgctatgtaaatagttgctatactgtgttgagggttttttgttttgtt
c
c
c
chromosomes.
ttgttttatttgtttgtttgtttgtattttaagagatggtgtcttgctttgttgcccaggctggagtgcagtgg
g
tgagatcatagcttactgcagcctcaaactcctggactcaaacagtcctcccacctcagcctcccaaagtgctg
a
ggatacaggtgtgacccactgtgcccagttattattttttatttgtattattttactgttgtattatttttaat

Two individuals are 99.9% the
tattttttctgaatattttccatctatagttggttgaatcatggatgtggaacaggcaaatatggagggctaac
same. I.e. differ
in ~ 3 M
g
t
g
tgtattgcatcttccagttcatgagtatgcagtctctctgtttatttaaagttttagtttttctcaaccatgtt
basepairs. c
a
a
tacttttcagtatacaagactttgacgttttttgttaaatgtatttgtaagtattttattatttgtgatgttat
ttaaaaagaaattgttgactgggcacagtggctcacgcctgtaatcccagcactttgggaggctgaggcgggca
t
g
gatcacgaggtcaggagatcaagaccatcctggctaacatggtaaaaccccgtctctactaaaaatagaaaaaa

SNPs occur once every ~600 bp
c
a
g
attagccaggcgtggtggcgagtgcctgtagtcccagctactcgggaggctgaggcaggagaatggtgtgaacc
c
g
tgggaggcggagcttgcagtgagctgagatcgtgccactgcattccagcctgcgtgacagagcgagactctgtc
c
g

Average gene in the human
aaaaaaataaataaaatttaaaaaaagaagaagaaattattttcttaatttcattttcaggttttttatttatt
a
g
t
tctactatatggatacatgattgatttttgtatattgatcatgtatcctgcaaactagctaacatagtttatta
genome spans
~27Kb
a
c
tttctctttttttgtggattttaaaggattttctacatagataaataaacacacataaacagttttacttcttt
cttttcaacctagactggatgcattttttgtttttgtttgtttgtttgctttttaacttgctgcagtgactaga
g
g
g

~50 SNPs per gene
gaatgtattgaagaatatattgttgaacaaaagcagtgagagtggacatccctgctttccccctgattttaggg
a
c
a
g
ggaatgttttcagtctttcactatttaatatgattttagctataggtttatcctagatccctgttatcatgttg
a
aggaaattcccttctatttctagtttgttgagattttttaattcatgtgattgcgctatctggctttgctctca
Haplotype
C A G
T T G
Haplotypes
G C T C G A C A A C A G
G T T C G T C A A C A G
SNP
Two individuals
SNP
SNP
Mutations
Infinite Sites Assumption:
Each site mutates at most once
Haplotype Pattern
C
T
C
C
A
T
A
T
G
G
T
G
T
A
G
T
0
1
0
0
0
1
0
1
0
0
1
0
0
1
0
1
At each SNP site label the two alleles as 0 and 1.
The choice which allele is 0 and which one is 1
is arbitrary.
Recombination
G T T C G A C A A C A T
A C G T A T C T A T T A
G T T C G A C T A T T A
Recombination
The two alleles are linked, I.e.,
they are “traveling together”
G T T C G A C A A C A T
A C G T A T C T A T T A
Recombination
disrupts the linkage
?
G T T C G A C T A T T A
Linkage Disequilibrium (LD)
Emergence of Variations Over
Time
Disease Mutation
Common
Ancestor
time
present
Variations in Chromosomes
Within a Population
Extent of Linkage Disequilibrium
Disease-Causing Mutation
2,000 gens.
ago
1,000 gens.
ago
Time =
present
A Data Compression Problem
The Minimum Informative Subset
A Data Compression Problem

Select SNPs to use in an association study


Very large number of candidate SNPs



Would like to associate single nucleotide polymorphisms (SNPs)
with disease.
Chromosome wide studies, whole genome-scans
For cost effectiveness, select only a subset.
Closely spaced SNPs are highly correlated

It is less likely that there has been a recombination between two
SNPs if they are close to each other.
Disease Associations
Association studies
Disease
Responder
Allele 0
Control
Non-responder
Allele 1
Marker A:
Allele 0 =
Allele 1 =
Marker A is
associated with
Phenotype
Association studies

Evaluate whether nucleotide
polymorphisms associate with
phenotype
A
C
G
A
G
A
C
G
A
T
A
T
A
A
G
C
T
A
G
T
A
T
G
G
T
A
T
G
G
G
Association studies
A
C
G
A
G
A
C
G
A
T
A
T
A
A
G
C
T
A
G
T
A
T
G
G
T
A
T
G
G
G
Association studies
0
0
0
0
0
0
0
0
0
1
0
1
1
0
0
1
1
1
1
1
0
1
0
1
1
0
1
0
1
0
Compression based on
Haplotype Resolution
D-graph of a SNP
For a SNP s we associate a bipaprtite graph.
Nodes: the set of haplotypes.
Edges: the set of pairs of haplotypes with different alleles at s.
1
1
0
0
0
1
0
1
1
0
0
0
0
1
1
s
1
s
2
D-graph of a set of SNPs
For a set of SNPs S we associate a bipaprtite graph.
Nodes: the set of haplotypes.
Edges: the set of pairs of haplotypes with different
alleles at some SNP s in S.
1
1
0
0
0
1
0
1
1
0
0
0
0
1
1
s
s
1
2
SNP Selection
Red SNP is equivalent to Blue SNP
1
1
0
0
0
1
0
1
1
0
0
0
0
1
1
SNP Selection
Red SNPs predict Green SNP
1
1
0
0
0
1
0
1
1
0
0
0
0
1
1
Data Compression
Minimal Informative Subset
1
1
0
0
0
1
0
1
1
0
0
0
0
1
1
Compresssion based on
Haplotype Blocks
Hypothesis – Haplotype Blocks?
The
genome consists largely of blocks of
common SNPs with relatively little recombination
within the blocks
Patil et al., Science, 2001;
Jeffreys et al., Nature Genetics, 2001;
 Daly et al., Nature Genetics, 2001
Haplotype Block Structure
LD-Blocks, and 4-Gamete Test Blocks
200 kb
Sense genes
DNA
Antisense genes
SNPs
Haplotype
blocks
1
2
3
4
Four Gamete Block Test

Hudson and Kaplan 1985
A segment of SNPs is a block if between every pair of SNPs at
most 3 out of the 4 gametes (00, 01,10,11) are observed.
0
0
1
1
0
1
1
1
BLOCK
1
1
0
1
0
0
1
1
0
1
1
0
1
1
0
1
VIOLATES THE BLOCK DEFINITION
Finding Recombination Hotspots:
Many Possible Partitions into Blocks
A
G
A
G
A
A
C
T
C
T
C
C
T
T
T
T
T
T
A
C
C
A
C
A
G
G
T
T
T
G
A
A
A
A
A
C
T
C
T
C
T
T
All four gametes are present:
A
A
G
G
A
G
G
A
A
A
G
G
C
C
T
C
T
C
C
A
C
A
A
A
T
T
G
T
T
T
The final result is a minimum-size set
of sites crossing all constraints.
A C T A G A T A G C C T
GFind
T the
T left-most
C G A right
C Aendpoint
A C of
A T
AEliminate
C
T
C
T
A
T
G
A
T
C
G
any
constraints
crossing
any constraint
and mark the
site
Repeat
until
all
constraints
are
gone.
G Tbefore
T Ait aTrecombination
A
C
G
A
C
A
T
that site.
site.
A C T C T A T A G T A T
A C T A G C T G G C A T
Data Compression
ACGATCGATCATGAT
GGTGATTGCATCGAT
ACGATCGGGCTTCCG
ACGATCGGCATCCCG
GGTGATTATCATGAT
A------A---TG-G------G---CG-A------G---TC-A------G---CC-G------A---TG--
Selecting Tagging SNPs in blocks
Haplotype Blocks based on LD
(Method of Gabriel et al.2002)
A New Measure
Informativeness
Informativeness
s
h
1
0
0
1
1
0
h
2
0
0
1
0
1
Informativeness
0
0
1
1
0
0
0
1
0
1
0
1
0
0
0
1
1
0
1
1
s1
s2
s3
s4
s5
I(s1,s2) = 2/4 = 1/2
Informativeness
0
0
1
1
0
0
0
1
0
1
0
1
0
0
0
1
1
0
1
1
s1
s2
s3
s4
s5
I({s1,s2}, s4) = 3/4
Informativeness
0
0
1
1
0
0
0
1
0
1
I({s3,s4},{s1,s2,s5})
=3
0
1
0
0
0
S={s3,s4} is a
1
1
s1
s2
0
s3
1
1
s4
s5
Minimal Informative Subset
Informativeness
e
6
Graph theory insight
s
e
5
Minimum Set Cover
s
4
=
Minimum Informative Subset
s
s
1
s
2
s
3
s
4
s
5
0
0
1
1
0
0
0
0
1
1
0
0
0
0
1
1
0
1
1
1
s
s
e
5
4
3
e
3
2
e
1
e
SNPs
Edges
2
1
Informativeness
e
6
Graph theory insight
s
e
5
Minimum Set Cover {s3, s4}
s
4
=
Minimum Informative Subset
s
s
1
s
2
s
3
s
4
s
5
0
0
1
1
0
0
0
0
1
1
1
1
0
0
0
0
1
1
0
1
s
s
e
5
4
3
e
3
2
e
1
e
SNPs
Edges
2
1
Real Haplotype Data
A region of Chr. 22
45 Caucasian samples
Two different runs of the Gabriel el al Block Detection method +
Zhang et al SNP selection algorithm
Our block-free algorithm
When Maximum Likelihood =
Bayesian = Parsimony
1 2 3 4 5 6 7 8 9101112131415161718192021222324252627282930
1 2 3 4 5 6 7 8 9101112131415161718192021222324252627282930
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9101112131415161718192021222324252627282930
14
13
12
11
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9101112131415161718192021222324252627282930
1 2 3 4 5 6 7 8 9101112131415161718192021222324252627282930
14
13
12
11
10
9
8
7
6
5
4
3
2
1
14
13
12
11
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9101112131415161718192021222324252627282930
14
13
12
11
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9101112131415161718192021222324252627282930
1 2 3 4 5 6 7 8 9101112131415161718192021222324252627282930
1 2 3 4 5 6 7 8 9101112131415161718192021222324252627282930
14
13
12
11
10
9
8
7
6
5
4
3
2
1
14
13
12
11
10
9
8
7
6
5
4
3
2
1
14
13
12
11
10
9
8
7
6
5
4
3
2
1
14
13
12
11
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9101112131415161718192021222324252627282930
1 2 3 4 5 6 7 8 9101112131415161718192021222324252627282930
14
13
12
11
10
9
8
7
6
5
4
3
2
1
14
13
12
11
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9101112131415161718192021222324252627282930
1 2 3 4 5 6 7 8 9101112131415161718192021222324252627282930
14
13
12
11
10
9
8
7
6
5
4
3
2
1
14
13
12
11
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9101112131415161718192021222324252627282930
1 2 3 4 5 6 7 8 9101112131415161718192021222324252627282930
14
13
12
11
10
9
8
7
6
5
4
3
2
1
14
13
12
11
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9101112131415161718192021222324252627282930
A
C
G
T
Related documents