Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Single Nucleotide Polymorphism
Linkage Disequilibrium And
Haplotypes
Xiaole Shirley Liu
Outline
• Definition and motivation
• SNP distribution and characteristics
– Allele frequency, LD, population stratification
• SNP and genotyping
• Haplotype inference:
– Clark’s algorithm
– EM and Gibbs sampling
– Hapmap project and 1000 Genomes
2
STAT115
Polymorphism
• Polymorphism: sites/genes with “common”
variation, less common allele frequency >= 1%,
otherwise called rare variant and not polymorphic
• Single Nucleotide Polymorphism
– Come from DNA-replication mistake
individual germ line cell, then transmitted
– ~90% of human genetic variation
• Copy number variations
– May or may not be genetic
3
STAT115
Why Should We Care
• Disease gene discovery
– Association studies, e.g. certain SNPs are
susceptible for diabetes
– Chromosome aberrations, duplication / deletion
might cause cancer
• Personalized Medicine
– Drug only effective if you have one allele
4
STAT115
SNP Distribution
• Most common, 1 SNP / 100-300 bp
– Balance between mutation introduction rate and
polymorphism lost rate
– Most mutations lost within a few generations
• 2/3 are CT differences
• In non-coding regions, often less SNPs at
more conserved regions
• In coding regions, often more synonymous
than non-synonymous SNPs
5
STAT115
SNP Characteristics:
Allele Frequency Distribution
• Most alleles are rare (minor allele frequency
< 10%)
6
STAT115
SNP Characteristics:
Linkage Disequilibrium
• Hardy-Weinberg equilibrium
– In a population with genotypes AA, aa, and Aa, if p =
freq(A), q =freq(a), the frequency of AA, aa and Aa
will be p2, q2, and 2 pq respectively at equilibrium.
– Similarly with two loci, each two alleles Aa, Bb
7
STAT115
SNP Characteristics:
Linkage Disequilibrium
•
Equilibrium
Disequilibrium
• LD: If Alleles occur together more often than can
be accounted for by chance, then indicate two
alleles are physically close on the DNA
– In mammals, LD is often lost at ~100 KB
– In fly, LD often decays within a few hundred
bases
8
STAT115
SNP Characteristics:
Linkage Disequilibrium
• Statistical Significance of LD
– Chi-square test (or Fisher’s exact test)
2
– eij = ni. n.j / nT
(
n

e
)
 2   ij ij
eij
i, j
9
B1
B2
Total
A1
n11
n12
n 1.
A2
n21
n22
n2.
Total n.1
n.2
nT
STAT115
SNP Characteristics:
Linkage Disequilibrium
• Haplotype block: a cluster of linked SNPs
• Haplotype boundary: blocks of sequence
with strong LD within blocks and no LD
between blocks, reflect recombination
hotspots
10
STAT115
SNP Characteristics:
Linkage Disequilibrium
• Haplotype block: a cluster of linked SNPs
• Haplotype boundary: blocks of sequence
with strong LD within blocks and no LD
between blocks, reflect recombination
hotspots
• Haplotype size
distribution
11
STAT115
SNP Characteristics:
Linkage Disequilibrium
• [C/T] [A/G] T X C [A/C] [T/A]
– Possible haplotype: 24
– In reality, a few common haplotypes explain 90%
variations
• Tagging SNPs:
Redundant
– SNPs that capture
most variations
in haplotypes
– removes
redundancy
12
STAT115
SNP Genotyping
• One SNP at a time or genome-wide (SNP array)
2.5kb
5.8kb
0.30
13
STAT115
40 Probes Used Per SNP
• Allele call
– AA, BB, AB
• Signal
– Theoretically
1A+1B, 2A, 2B
– But could
have 1A+3B
Amplified!
14
STAT115
Haplotype
• Haplotype: cluster of SNPs with LD
– Block with 10 SNPs has 210 possible haplotypes
– Only observe 5-6 haplotypes (> 90% cases)
– Tagging SNPs: subset of SNP to ID a haplotype
• Association (with disease) studies using
haplotype is more accurate than using single
SNP locus
• Haplotype inference: Aa BB Cc
15
STAT115
Haplotype Inference
• Genotyping only tells an individual is e.g.
Aa BB Cc, but it doesn’t tell whether
haplotype is: ABC + aBc, or ABc + aBC
• Haplotype can often be inferred if parental
genotype is known
– Similar to blood typing, e.g. F: A, M: AB, C: B
 F: , M: , C:
• Otherwise, look at the population
genotypes, infer common haplotypes
16
STAT115
Haplotype Inference
Clark’s Algorithm
1.
2.
3.
4.
17
Construct haplotypes from unambiguous individuals
Remove samples that can be explained as combinations
of haplotypes discovered already
Propose haplotype that would explain most remaining
Iterate 2 & 3 until finish
STAT115
Haplotype Inference
Clark’s Algorithm
1.
2.
3.
4.
Construct haplotypes from unambiguous individuals
Remove samples that can be explained as combinations
of haplotypes discovered already
Propose haplotype that would explain most remaining
Iterate 2 & 3 until finish
•
Disadvantages:
•
•
18
Depend on # of ambiguous subjects
Cannot get started when n is small
STAT115
EM and Gibbs Sampling in Motif Finding
• Problem
– Observe: sequence S
– Unknown: motif θ and site location A (alignment), but
given one, can infer the other
• EM and Gibbs Sampler
– Initialize random motif θ
– Iterate:
• Given θ and sequence S, update site location A
• Given A and S, update θ
– EM updates by weighted average
– Gibbs sampling updates by sampling
19
STAT115
Statistical Model for Haplotype
Haplotype
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
A
A
A
A
C
C
C
C
C
C
G
G
C
C
G
G
Frequency
C
G
C
G
C
G
C
G
-----------------
1
2
3
4
5
6
7
8
Haplotype Pool
2
1
4
8
2
6
3
6
6
5
7
6
1
1
• Each individual’s two haplotypes are treated as random
draws from a pool of haplotypes with certain frequencies
that can satisfy the genotyping
20
STAT115
Haplotype Inference
EM and Gibbs Sampler
• Observe genotype Y, estimate haplotype pair Z for
each individual and haplotype frequency 
• Initialize haplotype frequencies
• Iteration:
– Estimate Z given Y, 
– Estimate  given Y, Z
21
STAT115
Haplotype Inference
EM and Gibbs Sampler
• Observe genotype Y, estimate haplotype pair Z for
each individual and haplotype frequency 
• Initialize haplotype frequencies
• Iteration:
– Estimate Z given Y, 
– Estimate  given Y, Z
22
STAT115
Haplotype Inference
Partition-Ligation
• When #SNP is big, # possible haplotypes is too
big, so divide and conquer
– Consider an inferred sub-haplotype as one allele
23
STAT115
Hapmap of Human Genome
• HapMap: catalog of common genetic variants in
human
– What are these variants
– Where do they occur in our DNA
– How are they distributed within populations and
between populations around the world
• Goals:
– Define haplotype “blocks” across the genome
– Identify reference set of SNPs: “tag” each haplotype
– Enable unbiased, genome-wide association studies
24
STAT115
1000 Genomes Projects
• Characterization of human genome
sequence variation
• Foundation for investigating the
relationship between genotype and
phenotype
25
STAT115
Summary
• SNP and CNV
• SNP distribution and characteristics
– Allele frequency (minor allele > 1%)
– LD: linkage ~ physical proximity
– Population stratification
• SNP genotyping: SNP arrays, sequencing
• Haplotype inference
– Clarks: resolve unambiguous first, propose new
haplotypes to maximize explanation
– EM & Gibbs: iteratively infer haplotype frequency and
individuals’ haplotypes
26
STAT115
Acknowledgement
• Stefano Monti
• Jun Liu & Tim Niu
• Kenneth Kidd, Judith Kidd and Glenys
Thomson
• Joel Hirschhorn
• Greg Gibson & Spencer Muse
• Cheng Li & Yuhyun Park
27
STAT115
Related documents