Download Genetic Variation and Natural Selection Detection

Document related concepts

Genetic code wikipedia , lookup

Heritability of IQ wikipedia , lookup

Oncogenomics wikipedia , lookup

Pharmacogenomics wikipedia , lookup

Viral phylodynamics wikipedia , lookup

Genealogical DNA test wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Mutagen wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Inbreeding wikipedia , lookup

Tag SNP wikipedia , lookup

Frameshift mutation wikipedia , lookup

Hardy–Weinberg principle wikipedia , lookup

Human genetic variation wikipedia , lookup

Epistasis wikipedia , lookup

Dominance (genetics) wikipedia , lookup

Koinophilia wikipedia , lookup

Mutation wikipedia , lookup

Point mutation wikipedia , lookup

Group selection wikipedia , lookup

Genetic drift wikipedia , lookup

Polymorphism (biology) wikipedia , lookup

Population genetics wikipedia , lookup

Microevolution wikipedia , lookup

Transcript
Otto Warburg International Summer School and Research Symposium 2013
Genetic Variation and Natural Selection Detection
Shuhua Xu
CAS-MPG Partner Institute for
Computational Biology (PICB)
Genetic variation
• Genetic Variation is variation in alleles of genes,
occurs both within and among populations.
– Mutation
– Polymorphism
Polymorphism
• Polymorphism is often defined as the presence
of more than one genetically distinct type in a
single population.
• Rare variations are not classified as
polymorphisms; and mutations by themselves do
not constitute polymorphisms.
DNA polymorphism
•
•
•
•
•
•
•
•
RFLP (Restriction Fragment Length Polymorphism)
AFLP (Amplified Fragment Length Polymorphism)
RAPD (Random Amplification of Polymorphic DNA)
VNTR (Variable Number Tandem Repeat, or
Minisatellite)
STR (Short Tandem Repeat, or Microsatellite)
SNP (Single Nucleotide Polymorphism)
SFP (Single Feature Polymorphism)
CNV (Copy Number Variation)
Information from NGS
The 1000 Genomes Project
•
•
•
•
•
•
Full sequence data
Polymorphisms
Rare mutations
CNVs
Small indels
Recombination
Intuitive statistics
• Number of alleles
– More alleles, larger diversity;
• Minor allele frequency (MAF)
– is the frequency of the less (or least) frequent
allele in a given locus and a given population.
Mutation:
Polymorphism:
MAF ≤1%
MAF >1%
Heterozygosity
• The fraction of individuals in a population that
are heterozygous for a particular locus.
• It can also refer to the fraction of loci within
an individual that are heterozygous.
Observed
where n is the number of individuals in the population, and ai1, ai2
are the alleles of individual i at the target locus.
Expected
where m is the number of alleles at the target locus, and fi is the
allele frequency of the ith allele at the target locus.
Heterozygosity related issues
• Heterozygosity and HWD
– Comparison of Ho and He
• Gene diversity
Population Mutation Rate (q )
• Under mutation-drift equilibrium:
– q = 4Nem
– q = Ne m
– q = 3Nem
for autosome
for Y and mtDNA
for X chromosome
qautosome > qX > qY
Estimators of θ
•
•
•
•
•
•
Number of segregating sites (θK);
Average pairwise differences (θ∏);
Number of alleles (θE);
Mean number of mutations since the MRCA (θΩ);
Singleton
……
Number of segregating sites (K)
► Under
the infinite site model, K is equal to
the number of mutations since the most
recent common ancestor of the sequences
in the sample.
► Therefore, K has a clear biological
meaning.
► However, K depends on the sample size.
Normalized K
K
qK 
an
1
an  1  
2
1

n 1
Variance of θK
► Under
the neutral Wright-Fisher model with
constant effective population size,
E q K   q
bnq
Var q K    2
an an
q
1
bn  1  
4

2
1
 n  1
2
Wright-Fisher model
• N diploid individuals. Generations are non-overlapping. At each
generation, each chromosome inherits its genetic material from a
uniformly chosen chromosome from the previous generation,
independently from all other chromosomes.
• In its most basic form, the Wright-Fisher model overlooks many
important details:
–
–
–
–
–
–
–
–
1. Mutation
2. Recombination
3. Sexes
4. Non-overlapping generations
5. Population size changes
6. Family size distribution
7. Population structure
8. Selection
The properties of θK
• θK is independent of sample size.
• However, the usefulness of θK is not clear
under other population genetic models,
such as those with natural selection.
• θK is sensitive to the number of rare alleles,
or mutants of low frequency.
How many common SNPs in human genome?
 Common SNPs: minor allele frequency (MAF) >0.05;
 Suppose we have 50 samples of African, European, Asian
respectively;
 Theta=1.2/kb for African population;
 Theta=0.8/kb for European and Asian population;
 Autosome length (L)=2.68 billion bp;
n 1
104
E  S MAF 5%   Lq K 1 i
i 6
where
qK 

i 1
n 1
i
1

i 1 i
We expect 9.8 million common SNPs in 50 African samples;
► We expect 6.5 million common SNPs in 50 European
samples;
► We expect 6.5 million common SNPs in 50 Asian samples
►
θK =1.2/kb (African)
θK =0.8/kb (non-African)
Average pairwise differences (∏)
• Also known as
– sequence diversity
– mean number of nucleotide differences
between two sequences.
2

dij

n  n  1 i  j
E    q ,
2  n 2  n  3
n 1
Var    
q
q2
3  n  1
9n  n  1
The properties of ∏
• ∏ as a measure of genetic variation has clear biological
meanings which do not depend on the underlying
evolutionary process.
• In comparison to θK, it is insensitive to the rare alleles,
or mutants of low frequency.
• ∏ is an useful measure of persistent genetic variation,
and neutral genetic variation when purifying selection is
operating.
• However, because its variance is considerably larger
than that of θK, it is not as good as θK for neutral locus.
Number of alleles
• Ewens (1972) shows that under the infinite allele model
E k   1
•
•
q
q 1


q
q  n 1
An estimate of θ can be obtained by resolving the above
equation for θ with E(k) replaced by k.
The estimate is known as Ewens’s estimator θE.
The properties of θE
• Under the infinite allele model, θE is about the
best estimator one can devise.
• However, θE is slightly upward biased estimator
particularly when θ is large.
Mean number of mutations since the MRCA (Ω)
• The mean number Ω of mutations since the
most recent common ancestor (MRCA) of a
sample is another intuitive summary statistic, but
seldom used in practice.
• This is probably partly due to that its use
requires knowing for each segregating site the
ancestral nucleotide, and partly because its
because its statistical properties are not well
understood.
Mean number of mutations since the MRCA (Ω)
• Let ωl be the number of mutations in sequence l
since MRCA.
• Then the average is given by
1 n
A   l
n l 1
•
Note that a mutation of size i is counted as one
mutation in i of n sequences, we therefore have
1 n
A   ii
n l 1
Mean number of mutations since the MRCA (Ω)
• It follows that
Singleton mutations
• The number ξi of mutations of size 1 in a sample
is of special interest because it captures mostly
the recent mutations in a sample.
• According to Fu and Li (1993),
Classify the above summary statistics
• ∏0,0 =θ K
• ∏1,1 =θ∏
Weight of ∏k,l statistics
Natural selection
• Individuals with favorable traits are more
likely to leave more offspring better suited for
their environment.
• Also known as “Differential Reproduction”
• Also called “Survival of the fittest"
Artificial Selection
• The selective breeding of domesticated plants
and animals by man.
Favors the intermediate
variants
e.g. Human birth weight
Number of
individuals
Number of
individuals
Selects against the extreme
phenotypes
Range of values at time 1
Range of values at time 2
Number of
individuals
Stabilizing Selection
Range of values at time 3
Number of
individuals
Disruptive Selection
Favors variants of
opposite extremes
Number of
individuals
Range of values at time 2
Number of
individuals
e.g. London's peppered moths
Range of values at time 1
Range of values at time 3
Number of
individuals
Directional Selection
Favors one extreme
phenotype or other extreme
Number of
individuals
Range of values at time 2
Number of
individuals
e.g. beak length of the Galapagos
finches
Range of values at time 1
Range of values at time 3
Darwin’s 5 points
1. Population has variations.
2. Some variations are favorable.
3. More offspring are produced than
survive
4. Those that survive have favorable traits.
5. A population will change over time.
Balancing selection
• Balancing selection refers to a number of
selective processes by which multiple alleles
are actively maintained in the gene pool of a
population at frequencies above that of gene
mutation.
• heterozygote advantage
• frequency-dependent selection
Negative selection and positive selection
• Negative selection or purifying selection is the
selective removal of alleles that are
deleterious.
• Positive selection is selection on a particular
trait and the increased frequency of an allele
in a population
Footprints of natural selection in genomes
•
•
•
•
•
Loss of genetic diversity
Screwed allele frequency spectrum
Unexpected substitution ratio
Extended haplotype homozygosity
Elevated linkage disequilibrium
Gene Trees and Evolutionary Hypotheses
Gene Trees and Evolutionary Hypotheses
Neutral
Balancing Selection
Selective Sweep
Neutralist vs. selectionist view
Are most substitutions due to drift or natural selection?
“Neutralist” vs. “selectionist”
Agree that:
Most mutations are deleterious and are removed.
Some mutations are favourable and are fixed.
Dispute:
Are most replacement mutations that fix beneficial or neutral?
Is observed polymorphism due to selection or drift?
Neutral hypothesis as the null model
 Whether a locus has been evolving under natural
selection is often of interest if the locus represent a gene
or linked to one.
 As typical in many branches of sciences, a simpler
explanation of phenomenon is often preferred unless
there is strong evidence to suggest otherwise.
 In population genetics study, the neutral hypothesis of
evolution is arguably simpler than any other hypotheses
and is much better understood statistically.
 As a result, it is now generally used as the null model for
analyzing polymorphism.
 A significant deviation from the null model may signal the
presence of forces that are absent or factors that are
over-simplified in the null model.
Statistical tests using estimators of θ
• There are several ways statistical tests can be
constructed to see if the null model is adequate for
explaining the observed amount and pattern of
polymorphism.
• Many summary statistics (estimators of θ) have quite
different expectation when the null model is violated,
this offer an opportunity of testing by considering
the difference between two measures of
polymorphism.
Statistical tests using estimators of θ
Suppose L1 and L2 are two different summary statistics
such that E(L1) =E(L2) under the hypothesis of strict
neutrality.
Then one way to test the null hypothesis of strict
neutrality is to use the normalized difference
as test statistic.
Normalization is intended to minimize the effect of unknown
parameter(s) so that the resulting test is more rigorous.
Note that V ar(L1−L2) is a function of θ so its value needs
to be estimated.
Statistical tests using estimators of θ
 Although every pair of statistics L1 and L2 can be used to
construct a test as long as E(L1) = E(L2) and V ar(L1−L2)
can be computed, such a test is useful only if the values
of L1 and L2 are likely different when the locus under
study depart from neutrality.
 Unfortunately the distribution of a test of the form above
is not well approximated by any standard distribution, so
that obtaining critical values from a large number of
simulated samples is commonly used, which means that
the best way to apply such tests is to use a computer
package that implement the test.
 Therefore, we will focus on discussing the rational of
several tests rather than detail of their computations.
Tajima test
  K / an
D
Var   K / an 
• the parameter θ required for computing the
variance is estimated by K/an.
Rational of Tajima test
• Since K ignores the frequency of mutants, it is
strongly affected by the existence of deleterious
alleles, which are usually kept in low frequencies.
• In contrast, ∏ is not much affected by the
existence of deleterious alleles because it takes
the frequency of mutants into consideration.
• Therefore, a D value that is significantly different
from 0 suggests that the null hypothesis should
be rejected.
Indication of Tajima’s D
• When a population has been under selective
sweeps (and population growth), K/an will likely
be larger than ∏, resulting in negative value of D.
• When a population has been under balance
selection (or population structure with sampling
from many populations), K/an will likely be
smaller than ∏, resulting in positive value of D.
Tajima’s D Expectations
• Neutrality: D=0
• Balancing Selection: D>0
– Divergence of alleles (π) increases
• Purifying or Positive Selection: D<0
– Divergence of alleles decreases
• Also
– Bottleneck, D>0 (S decreases)
– Population expansion: D<0 (Divergence of alleles
decreases: many low frequency alleles)
Tajima’s D Expectations
balancing selection
neutral
selective sweep
K
Many low frequency
variants and singletons,
D negative
 K
D=0
q
Pairwise differences (k)
increase faster than S
D positive
Distribution of Tajima’s D
Distribution of Tajima’s D with θ = 5 and n = 100
Fu and Li test
where the parameter required for computing variances is also estimated by K/an.
Rational of Fu and Li test
• Test D is preferred over D* whenever the size of
singleton can be resolved, for example, by using an
outgroup sequence or by the help of phylogeny
reconstruction.
• The reasons for focusing on external or singleton
mutations are as follows.
– In the presence of natural selection, deleterious mutations are likely to
be eliminated from a population quickly or present in low frequencies.
– In other words, deleterious mutations are usually recent mutations
and they are most likely to be found in the external branches of the
sample genealogy, i.e., they are most likely external mutations or
singletons.
– In contrast, mutations found in the internal branches are not as young,
they are more likely to be neutral and their frequency is less affected
by the presence of selection.
– Therefore, contrast between external and internal mutations, or
contrast between singletons and non-singletons can be used to detect
the presence of natural selection.
Indication of Fu and Li D
• Negative values of D and D* indicates an excess
of recent mutations or rare alleles (positive
selection and/or population expansion).
• Positive values indicates an excess of common
alleles (balance selection and/or population
structure).
Distribution of Fu and Li D and D*
Distribution of Fu and Li D and D* with θ = 5 and n = 100
Fay and Wu test
• Fay and Wu(2000) proposed a test which in
our notation is
• ∏0,0 =θ K
• ∏1,1 =θ∏
• ∏1,0 =θΩ
Fay and Wu test
Fay and Wu test
Distribution of H’
KS and KA
• For proteins, two major categories of changes
are synonymous (KS) and non-synonymous (KA)
• The likelihood of synonymous vs. nonsynonymous change depends upon the
nucleotide codon position (first, second or third)
Genetic Codon and Codon Degeneracy
•
Codon degeneracy
•
A change in the first position nucleotide
almost always causes a nonsynonymous change
•
A change in the second nucleotide
always causes a non-synonymous
change
•
The third position is more complicated
(being often two-fold degenerate)
•
Also as adjacent nucleotide sites change,
these probabilities change
KS, KA
• To get at these rates, need to classify nucleotide sites as
synonymous vs. non-synonymous
• KS = The number of synonymous changes divided by
the number of synonymous sites
• KA = The number of non-synonymous changes divided
by the number of non-synonymous sites
KA/KS
test
dN K A

 1 ►Neutral theory prediction if a non-syn.
dS KS
substitution is neutral.
dN K A

 1 ►Neutral theory prediction if a non-syn.
substitution is under purifying selection
dS KS
dN K A

 1 ►Selection theory prediction if a non-syn.
substitution is under positive selection
dS KS
McDonald-Kreitman Test
► Tracks
synonymous versus nonsynonymous
substitutions
 Fixed between species
► Non-synonymous
► Synonymous
SF
NF
 Polymorphic within species (pairwise comparisons)
► Non-synonymous
► Synonymous
SP
NP
► NF/SF=NP/SP under neutrality
► More sensitive to detection of
positive selection
McDonald-Kreitman logic
►Silent sites
- always neutral
- fix slowly
- contribute to polymorphism
►Replacement sites
– mainly unfavourable
– if neutral, fix at same rate as silent and contribute to
polymorphism
– proportion of replacement mutations that are neutral
determines dN / dS for polymorphism
– if favourable, fix quickly and do not contribute to
polymorphism: higher dN / dS for fixed differences, lower
rate for polymorphism
Time to fixation: favorable and neutral
McDonald-Kreitman hypotheses
H0: All mutations are neutral.
Then, dN / dS for polymorphic sites should equal
dN / dS for fixed differences
H1: replacements are favoured. Favoured
mutations fix rapidly, so dN / dS for polymorphic
< dN / dS fixed
McDonald-Kreitman test
‘coding’
‘non-coding’
Example of MK test: ADH in Drosophilia
Compare sequences of D. simulans and D. yakuba
for ADH (alcohol dehydrogenase)
Fixed
differences
Polymorphic
sites
Replacement
7
2
Silent
17
42
7 / 24 = 29%
2 / 44 = 5%
% fixed
Significance? Use χ2 test for independence
Neutral polymorphism and divergence
Polymorphism q = 4Nm
high m N
Polymorphism q = 4Nm
low m, N
Ratio D/q 
2mt /4Nm 
t/2N
Divergence
D = 2m
(low m )
m
D = 2m
(high m)
m
t
(in generations)
The HKA test (Hudson-Kreitman-Aquade)
• Compares the level of polymorphism within species
with the level of divergence between species
– Expected level of polymorphism is estimated from the
level of divergence
– Ratio of polymorphism to divergence should be the same
for all neutral loci and is set by the mutation rate for a
locus
– Level of neutral divergence should be unaffected by
occasional selective sweeps
The HKA test (Hudson-Kreitman-Aquade)
In the HKA test, the levels of polymorphism and
divergence in two or more loci are considered:
Locus 1
P1
S1
Polymorphisms
Substitutions


2
X
2
i 1

Locus 2
P2
S2


2
2
k
ˆ
ˆ
Pi  E ( Pi )
Si  E ( Si )

Vˆ ( Pi )
Vˆ ( Si )
i 1
The HKA test (Hudson-Kreitman-Aquade)
A Prediction of the Neutral Theory of Molecular Evolution
And Departures from Neutrality
Frequency-Dependent Selection
Balancing Selection
Hudson Kreitman Aguadé
test
HKA
Test
Fixed Poly
Variation
Within
Species
at Locus X
Neutral
Zone
Adaptive Divergence
Selective Sweep
Divergence Between Species at Locus X
Locus 1
50 5
Locus 2
30 3
Population Genomics
Population Genetics
Local adaptation (positive selection)
Functional restriction (negative selection)
Disease (negative selection)
Population genomics approach
 Easy to distinguish the selection signals from demographic events.
 Need not a prior knowledge of the gene function.
A typical population
genomics study
design for detecting
positive selection
Joshua M. Akey
Genome Res. 2009 19: 711-722
Finding selective sweeps in
genomic (NGS) data
Problem
We do not fully know the shape of the neutral
distribution and how it’s affected by other factors
such as demographic history.
However, the best we
can do:
• use statistic based on
simulations
• apply it to empirical
genome-wide data sets
• Identify the loci in the
extreme tail
Most likely
candidate
of
selection
Test based on the relationship between
allele frequency and extent of linkage disequilibrium
No Selection
Young alleles:
• low frequency
• long-range LD (long haplotypes)
Old alleles:
• low or high frequency
• short-range LD
Positive Selection
Young alleles:
• high frequency
• long-range LD
Linkage Disequilibrium
(Homozygosity)
The signal of selection
Positive Selection
Neutrality
frequency
Long-range multi-SNP haplotypes
Core
markers
Long-range markers
C/T
A/G
A/G
C/T
C/T
gene
C/T
C
T
A
T
G
C
C
Decay of
homozygosity
G
T
G
(probability, at any distance,
that any two haplotypes that
start out the same have all the
same SNP genotypes)
C
T
C
T
T
T
C
C
100%
75%
35%
18%
Slide by: David Reich, Broad Institute
iHS: Measures the extent of haplotypes along
alleles at a given SNP
Derived Allele
EHH
Ancestral Allele
0.05
Genetic Distance
iHHA : iHH with respect to Ancestral core allele.
iHHD : iHH with respect to Derived core allele.
iHS Score
• Useful for variants that have not yet reached
fixation.
• Large negative iHS: derived allele has swept
up in frequency
• Large positive iHS: an ancestral alleles
hitchhike with the selected sites.
• Hence, both cases are considered interesting!
Summary
• Footprints of natural selection could be detected by
examining allele frequency spectrum and LD pattern
Exome sequencing data:
• KA/KS
• HKA
• MK
Whole genome sequencing data:
• Tajima’D
• LD-based or EHH-based approaches