Download Alleles - Amazon S3

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Public health genomics wikipedia , lookup

Designer baby wikipedia , lookup

Genetic engineering wikipedia , lookup

Genetics and archaeogenetics of South Asia wikipedia , lookup

Pharmacogenomics wikipedia , lookup

Heritability of IQ wikipedia , lookup

Dual inheritance theory wikipedia , lookup

SNP genotyping wikipedia , lookup

Medical genetics wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Viral phylodynamics wikipedia , lookup

Inbreeding wikipedia , lookup

Genome (book) wikipedia , lookup

Gene expression programming wikipedia , lookup

Group selection wikipedia , lookup

Genome-wide association study wikipedia , lookup

Frameshift mutation wikipedia , lookup

Koinophilia wikipedia , lookup

Mutation wikipedia , lookup

Epistasis wikipedia , lookup

Polymorphism (biology) wikipedia , lookup

Point mutation wikipedia , lookup

Human genetic variation wikipedia , lookup

Hardy–Weinberg principle wikipedia , lookup

Tag SNP wikipedia , lookup

Dominance (genetics) wikipedia , lookup

Genetic drift wikipedia , lookup

Microevolution wikipedia , lookup

Population genetics wikipedia , lookup

Transcript
CSE291: Personal genomics for bioinformaticians
Class meetings: TR 3:30-4:50 MCGIL 2315
Office hours: M 3:00-5:00, W 4:00-5:00 CSE 4216
Contact: [email protected]
Today’s schedule:
• 3:30-4:15 Intro to population genetics
• 4:15-4:20-break
• 4:20-4:50 Overview of PS2, Journal discussion
Announcements:
• PS1 due tomorrow. Any issues with XSEDE?
• PS2 is out. Due Jan. 26.
• Readings for Lecture #3 posted (short)
Intro to population genetics
CSE291: Personal Genomics for
Bioinformaticians
01/12/17
Outline
• Patterns of genetic variation
• Linkage disequilibrium
• Discuss problem set 2
• Journal club: Positive
selection
Patterns of genetic variation
Four forces driving evolution
• Mutation: introduces genetic variation
• Genetic drift: Evolution is inherently a
stochastic process
• Natural selection: Advantageous mutations
more likely to spread
• Gene flow: exchange of genetic material
between populations
“Nothing in biology makes sense except in the light of evolution”
– Theodosius Dobzhansky (1973)
You share genes with me
Identical twins:
0 differences
Africa
West Eurasia
Unrelated humans
1/1,000
South Asia
America
Human vs. chimp
1/100
East Asia/Oceania
Mallick, et al. 2016
Mutation: the fuel for evolution
TCATTTGAGCATTAAATGTCAAGTTCTGCAC
TCATTTGAGCATTAAGTGTCAAGTTCTGCAC
Rs12913832
(e.g. eye
color
mutation)
• Human mutation rate ~1.5 x 10-8 mutations/bp/generation.
• ~50 new mutations per generation
• Classes of mutations:
• Single nucleotide polymorphisms (SNPs)
• Insertions/deletions (indels)
• Structural variants (SVs), copy number variants (CNVs)
Roach et al. 2010; Conrad et al. 2011
Mutation: a not-so-random process
• Per-generation mutation rate highly dependent on
paternal age
• Local sequence features influence local mutation rates
across the genome. Most important: trinucleotide context
(e.g. the base right before and right after a given position)
Michaelson, et al. 2012
The language of genetic variation
Chromosome
Diploid: contains two
complete sets of
chromosomes, one
from each parent
Haploid: contains one
complete set of
chromosomes
TTAAATGTC
A,G
AA, AG, GG
Locus: e.g. chr1:1000
Allele: one of two or more
alternative versions of a
specific locus due to
mutation
Genotype: specifies the
allele on each chromosome
copy.
More terminology about SNPs
C
C
C
C
A
C
C
A
Reference allele: which nucleotide is at this
position in the reference genome?
Alternate allele: an allele in the population that
doesn’t matches the human reference
Major allele: most common allele for a given
position. In this example, C. Note not always does
major allele = reference allele.
Minor allele: any allele besides the major allele. In
this example, A
Minor allele frequency (MAF): frequency of the
minor allele. Here MAF=0.25
rsid: a unique identifier for a given SNP, as used by
dbSNP
Four forces driving evolution
• Mutation: introduces genetic variation
• Genetic drift: Evolution is inherently a
stochastic process
• Natural selection: Advantageous mutations
more likely to spread
• Gene flow: exchange of genetic material
between populations
“Nothing in biology makes sense except in the light of evolution”
– Theodosius Dobzhansky (1973)
Genetic drift
Each generation: choose n alleles with replacement
Allele Freq.
Gen. 1
pred=0.50; pblue=0.5
Gen. 2
pred=0.35; pblue=0.6
Gen. 3
pred=0.40; pblue=0.6
Gen. 4
pred=0.30; pblue=0.7
Gen. 4
pred=0.20; pblue=0.8
Fixation
…
pred=0.00; pblue=1.0
Wright-Fisher model
N individuals
2N alleles in a diploid population
2 alleles: A, B
p frequency of allele A
q frequency of allele B
Assumptions:
• Non-overlapping generations
• Constant population size
• New generation drawn at random
pkq2N-k
Probability that next generation has k copies of allele A:2N
k
(this is taken directly from the binomial distribution)
( )
• In a finite population, what eventually happens to allele frequencies?
Fixation or loss
• How is rate of fixation related to population size?Grows with population size
Time to fixation: Tfixed = -4N(1-p)ln(1-p)
≈4N in large population with p=1/2N
p
• What is the probability of fixation of an allele due to drift?Initial frequency p
Example: Effect of bottlenecks
Ancestral population: pred= 7.4% pgreen=7.4% pblue=85.2%
Bottleneck
Founder population: pred= 25% pgreen=0% pblue=75%
Bottlenecks have dramatic effects on frequencies of rare alleles
Example bottleneck populations:
• Early American settlers
• Several thousand French Canadian settlers. High rate of Leigh Syndrome,
Tyrosinemia Type I
• Ashkenazi Jewish
• Descend from several hundred people. High rate of e.g. Gaucher, Tay-Sachs,
CF
• Finland
• Bottleneck 4000 yrs ago. 36 “Finnish heritage diseases” e.g. Usher Syndrome
Effective population size
• In reality, population size is not constant!
• “Effective population size” captures what the size would have to be for the
population to behave like an idealized population
• Larger effective population size -> more genetic diversity
• Recent population explosion -> tons of new rare alleles.
https://en.wikipedia.org/wiki/World_population#/media/File:Population_curve.svg
Effective population size – differs by ancestry
McEvoy et al. 2011
Hardy Weinberg Equilibrium
Hardy-Weinberg principle: allele and genotype frequencies remain constant in the
population over time if all other evolutionary forces are held constant
Alleles
A
B
f(A) = p
f(B) = q = 1-p
Genotypes
Under HWE:
Female
Male
A (p)
B (q)
A (p)
AA (p2)
AB (pq)
B (q)
AB (qp)
BB (q2)
f(AA) = p2
f(AB) = 2pq
Assumptions:
• Diploid
• Sexual reproduction
• Random mating
• Large population size
• No selection, mutation,
migration
Useful as a quality metric
for SNP calling
f(BB) = q2
f(A) = f(AA) + 0.5f(AB) = p2 + pq = p(p+q) = p
f(B) = f(BB) + 0.5f(AB) = q2 + pq = q(q+p) = q
Four forces driving evolution
• Mutation: introduces genetic variation
• Genetic drift: Evolution is inherently a
stochastic process
• Natural selection: Advantageous mutations
more likely to spread
• Gene flow: exchange of genetic material
between populations
“Nothing in biology makes sense except in the light of evolution”
– Theodosius Dobzhansky (1973)
Natural selection
https://9sc4evolution.wikispaces.com/4.+Natural+Selection
Natural selection
Selection coefficient s: relative fitness of red vs. blue
P(red sampled) =
pred(1+s)
s<0: red is deleterious
s>0: red is beneficial
s=1: standard Wright Fisher, no selection
pred(1+s) + pblue
e.g. with s>>0, fixation happens quickly
Gen. 1
pred=0.50; pblue=0.50
Gen. 2
pred=0.60; pblue=0.40
Gen. 3
Fixation
…
pred=0.80; pblue=0.20
pred=1.00; pblue=0.00
Patterns of selection
AA
AB
BB
AA
Fitness
Disruptive
Fitness
Balancing
Fitness
Directional
AB
BB
Genotype
Genotype
Example: lactose tolerance
Example: sickle-cell anemia
AA
AB
BB
Genotype
Example: ? butterfly colors
Example directional selection: lactose intolerance
• Most humans can’t digest lactase past childhood.
• A SNP in the gene LCT (C/T-13,910) is associated with
the ability to digest milk in Europeans (autosomal
dominant)
• Allele frequency decreases North to South in Europe,
thought to be associated with strong selective advantage
in dairy farmers
• An independent mutation in the same gene in African
On populations
23andMe: rs4988235T/A
near LCT gene.
(I am
AA=tolerate
lactose)
also associated
with
lactase
persistance.
Example balancing selection: sickle cell anemia
Observation (Haldane): red blood cell disorders (e.g. sickle-cell
anemia, thalassemias) common in areas where malaria is endemic.
Sickle cell anemia: due to mutations in beta hemoglobin (HBB),
patients experience infections, pain, and fatigue
Homozygous wildtype: normal
Heterozygous: “sickle cell trait”, resistant to malaria!
Homozygous mutant allele: affected by sickle cell anemia
https://prezi.com/5yynhkp0ffnj/sickle-cell-anemia/
On 23andMe: i3003137 (rs334) T/A in gene HBB. (I am TT=homozygous wildtype)
More examples: skin color
World distribution of A111T polymorphism in
SLC24A5
• A111T mutation in SLC24A5 associated with lighter skin
pigmentation in Europeans
• Became predominant in Europe ~10-20K years ago
• Hypothesis: selection for derived allele based on need
for sunlight to produce vitamin D
• One of the strongest signals of positive selection in
Europeans
On 23andMe: rs1426654 A/G in SLC24A5. (I am AA=light skin)
Canfield et al. 2014
Sabeti et al. 2007
Allele frequency vs. fitness effect
severe
Effect size
e.g. Cystic Fibrosis,
Tay-Sachs
Nonexistent
(removed by selection)
Severe Mendelian
disorders
Likely many
examples, but low
power to detect
these
mild
rare
e.g. high
cholesterol, Crohn’s
Disease, Type II
Diabetes
(many common alleles
with small effect sizes)
common
Allele Frequency
Four forces driving evolution
• Mutation: introduces genetic variation
• Genetic drift: Evolution is inherently a
stochastic process
• Natural selection: Advantageous mutations
more likely to spread
• Gene flow: exchange of genetic material
between populations
“Nothing in biology makes sense except in the light of evolution”
– Theodosius Dobzhansky (1973)
Admixture in modern and ancient humans
Ewen Callaway Nature News 201
Neanderthal admixture and diabetes
Risk variants inherited from
Neanderthal admixture!
Linkage disequilibrium
Recombination
https://www.reddit.com/r/askscience/comments/3hq4zl/does_crossover_occur_in_all_4_nonsister/
Example history of two neighboring alleles
Present-day polymorphisms are the result of ancient mutation events:
A
Ancestral haplotype
C
A
After mutation
C
C
T
A
After another mutation
C
T
C
T
G
Haplotype: combination of variants on the same copy of a chromosome
Recombination rearranges ancestral alleles
A
Before recombination
C
T
C
T
G
A
After recombination
C
T
C
T
G
A
G
Recombination event!
Linkage disequilibrium – present day samples
Ancestral haplotype
Present-day haplotypes
Patterns at present-day haplotypes
depend on:
• Recombination rate
• Mutation rate
• Population size
• Natural selection
The closer two markers are, the more
likely they are to share the same ancestral
haplotype
Nearby SNPs are highly correlated
Recombination induces
haplotype “blocks” of correlated
SNPs
Linkage disequilibrium decays
with distance
http://graphics.cs.wisc.edu/WP/vis10/archives/458-hapmap-linkagedisequilibrium-plot
Factors affecting LD
LD decays faster in populations
with higher heterozygosity
Recombination occurs in “hotspots”
along each chromosome
A global reference for human genetic variation. Nature 2015
Mammalian recombination hot spots: properties, control and evolution. Nature Reviews Genetics 20120
LD ends up saving us $$ - “Tag SNPs”
• LD induces tight correlation between nearby variants
• Rather than genotyping and testing each SNP for association with
disease, can focus on a few “tag” SNPs
Haplotype 1 AACACAAGCTAGCTACCTACGTAGCTACAT 30%
Haplotype 2 AACACAAGCTAGCTACCTACGTAGGTACAT 30%
Haplotype 3 AACACTAGCTAGCTACCTATGTAGGTACAT 20%
Haplotype 4 AACACTAGCAAGCTACCTACGTAGGTACAT 10%
Haplotype 5 AACACTAGCAAGCTACCTACGTAGGTACGT 10%
LD ends up saving us $$ - Imputation
AACACAAGCTAGCTACCTACGTAGCTACAT
AACACAAGCTAGCTACCTACGTAGGTACAT
Reference haplotype
panel (e.g. 1000
Genomes)
AACACTAGCTAGCTACCTATGTAGGTACAT
AACACTAGCAAGCTACCTACGTAGGTACAT
AACACTAGCAAGCTACCTACGTAGGTACGT
AACACTAGC?AGCTACCTA?GTAGGTAC?T
AACACAAGC?AGCTACCTA?GTAGGTAC?T
Your data genotyped on
SNP chip
AACACTAGC?AGCTACCTA?GTAGGTAC?T
AACACTAGC?AGCTACCTA?GTAGGTAC?T
AACACAAGC?AGCTACCTA?GTAGCTAC?T
Linkage disequilibrium metrics
A/a
pAB=pApB
pAb=pA(1-pB)
paB=(1-pA)pB
paB=(1-pA)(1-pB)
Linkage disequilibrium:
pAB≠pApB
pAb≠pA(1-pB)
paB≠(1-pA)pB
paB≠(1-pA)(1-pB)
b
Total
pAB
pAb
pA
paB
pab
pa
pB
pb
1.0
B/b
Frequencies: A
a
Linkage equilibrium:
B
Total
D: pAB-pApB
• Sign is arbitrary
• Range depends on allele frequencies
D’: D/Dmin (normalize by theoretical maximum)
• Dmin=max(-pApB, -(1-pA) (1-pB) if D<0
• Dmin=min(pA(1-pB), (1-pA)pB) if D>0
• Range between -1 and +1
r: D/sqrt[pA(1-pA) pB(1-pB)]
• Same as correlation coefficient. Between -1 and
• R2 gives power loss in association studies
• Most commonly used in population genetics
Problem set 2
Prob. 1: Principal components ancestry analysis
Lecture #3 (01/17/17): Determining Ancestry
Prob. 2: Relative finding
Given a set of genomes, estimate all pairwise relationships
Lecture #3 (01/17/17): Determining Ancestry
Prob. 3: Imputation
AACACAAGCTAGCTACCTACGTAGCTACAT
AACACAAGCTAGCTACCTACGTAGGTACAT
Reference haplotype
panel (e.g. 1000
Genomes)
AACACTAGCTAGCTACCTATGTAGGTACAT
AACACTAGCAAGCTACCTACGTAGGTACAT
AACACTAGCAAGCTACCTACGTAGGTACGT
AACACTAGC?AGCTACCTA?GTAGGTAC?T
AACACAAGC?AGCTACCTA?GTAGGTAC?T
Your data genotyped on
SNP chip
AACACTAGC?AGCTACCTA?GTAGGTAC?T
AACACTAGC?AGCTACCTA?GTAGGTAC?T
AACACAAGC?AGCTACCTA?GTAGCTAC?T
Lecture #4 (01/19/17): Phasing and Imputation
Journal club: Positive
selection
Signals of selection
Signs of positive selection:
• High derived (non-ancestral) allele frequency
• High differentiation between populations (reflecting recent
local adaptations)
• Long haplotypes: new alleles have risen to high frequency
without enough time to break down haplotypes by
recombination
Tests: LRH, iHS, XP-EHH
Signs of purifying selection:
• Less variation than expected by chance. Deleterious allele
is rare.
Tests: Tajima’s D, Fu and Li Test, many others
Websites to explore specific variants
• dbSNP (https://www.ncbi.nlm.nih.gov/snp).
Most SNP identifiers (rsids) are from
dbSNP. Website documents source of data,
MAF, alleles.
• ExAC (http://exac.broadinstitute.org/).
Genetic variation from 60,000+ samples.
Provides allele frequencies by population
• SNPedia (https://snpedia.com/). Detailed
curated information about specific SNPs
Examples
• LCT - rs4988235 (lactose tolerance)
https://www.ncbi.nlm.nih.gov/snp/?term=rs4988235
• SLC24A5 - rs1426654 (light skinned,
http://exac.broadinstitute.org/variant/15-48426484-AEuropean)
G
• EDAR, EDA2R - rs3827760 (hair
development, Asians)
http://exac.broadinstitute.org/variant/2-109513601-A-G
• HBB – rs334 (sickle cell)
http://exac.broadinstitute.org/variant/11-5248232-T-A
Bonus slides
Mutation selection balance
Mutation-selection balance: rate at which deleterious mutations arise by mutation
is equal to the rate at which they are removed by selection
AA: fitness (w11) = 1
AB: fitness (w12) = 1-hs
BB: fitness (w22) = 1-s
w = p2w11+2pqw12+q2w22
h: dominance effect (between 0 and 1)
s: selection coefficient
A mutates to B at rate μ
P(A) = p, P(B) = q
Frequency of allele A in next generation is controlled by both mutation and
selection:
p’ = (p2w11+pqw12)/w(1-μ)
At equilibrium state, p’ = p. Solving gives:
q =μ/hs (partial dominance, h>0)
q =sqrt(μ/s) (completely recessive, h=0)
Thus: high s (very deleterious) pushes the allele frequency of the deleterious allele
(B) to low frequency.
Modeling genetic diversity – Coalescent Theory
Ancestral population
Most recent common ancestor (MRCA)
Descendants for each generation
are chosen at random with
replacement from current
generation
Key of coalescent theory:
analyze backwards, rather than
forward, in time.
Focus on ancestors of current
population (e.g. only yellow).
Extremely useful framework for
simulating haplotype histories
Present-day haplotypes
http://www.csbio.unc.edu/mcmillan/index.py?run=Courses.Comp790S09
Modeling genetic diversity – Coalescent Theory
Population size: 2N (N=number of
samples)
Coalescence after 1 generation: 1/2N
P(allele in gen. t has parent in t-1) = 1
P(another allele in gen. t has same parent) =
1/2N
Coalescence after 2 generations:
1/2N(1-1/2N)
Coalescence after t generations:
1/2N(1-1/2N)t-1
Coalescent time of two gene copies follows
a geometric distribution with mean 2N
For k gene copies: 4N(1-1/k)
Coalescence with mutation
Mutations happen along the branches of the tree with at some
rate μ per generation. For a pair of sequences:
MRCA
t generations
# mutations between sample 1 and sample 2
= 2tLμ
Sample 1
Sample 2
Where L is the length of the gene
sequence
Applications of coalescent with mutation:
• Infer population genetics parameters e.g.
mutation rate, from sequence
• Infer effective population size
• Test for selection by comparing observed
vs. expected variation
More resources on coalescent theory
Textbooks
• Principles of Population Genetics. Daniel L Hartl & Andrew G. Clark
• Evolutionary Theory. Sean H. Rice
Simulation tools
• Fastsimcoal
• simcoal2
• Ms
• Msms
• Many others…