Download Genome evolution: a sequence

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Genome evolution:
a sequence-centric approach
Lecture 8-9: Concepts in
population genetics
(Probability, Calculus/Matrix theory, some graph theory, some statistics)
Simple Tree Models
HMMs and variants
PhyloHMM,DBN
Context-aware MM
Factor Graphs
DP
Sampling
Variational apx.
LBP
EM
Generalized EM
(optimize free energy)
Probabilistic models
Genome structure
Inference
Mutations
Parameter estimation
Population
Inferring Selection
Today refs: Hartl and Clark, Topics from Chapters 3-7
See Gruer/Li chapter 2 (easy to read overview) and lynch chapter 4 (more advanced)
Tree of life
Genome Size
Elements of genome
structure
Elements of genomic
information
Studying Populations
Models:
A set of individuals, genomes
Ancestry relations or hierarchies
mtDNA human migration patterns
Experiments:
Fields studies, diversity/genotyping
Experimental evolution
Åland Islands, Glanville fritillary population
Species and populations
What is a species?
Multiple definitions, most of them rely on free flow of genetic information within
and weak flow of information outside/inside
Species 1
Species 2
Species can emerge through the formation of reproductive barriers
Allopatric speciation – occurs through geographical separation
Parapatric speciation – occurs without geographical separation but with weak
flow of genetic information
Sympatric speciation – occurs while information is flowing - controversial
Barriers can be genetic, physical, behavioral
Population dynamics
We think of a species genome as representing the population “average” genomic
information
Individuals have genomes that are closely related to the “species genomes”, but
differ from it in certain loci (alleles)
As the population evolve there are continuous changes in allele frequencies,
which may result in ultimate changes in the genome (fixation)
In haploid populations (bacteria), genotypes are determined by one haplotype
and ancestral relations are simple trees
In diploid populations things are a bit more complex, as genotypes can be
homozygous or heterozygous at each locus.
We can measure and quantify just few aspect of this evolutionary dynamics:
Size of populations
Allele frequencies
The average homozygosity/heterozygosity of an allele
How many alleles at a locus
Population genetics is dealing with theories that predict the behavior of these
quantities using simple assumption on the evolutionary dynamics
Frequency estimates
We will be dealing with estimation of allele frequencies.
To remind you, when sampling n times from a population with allele of
frequency p, we get an estimate that is distributed as a binomial
variable. This can be further approximated using a normal
distribution:
V ( B( p; n))  N (np, np(1  p) )
When estimating the frequency out of the number of successes we
therefore have an error that looks like:
s
pˆ (1  pˆ )
n
Simplest model: Hardy-Weinberg
Studying dynamics of the frequencies of two alleles A/a of a gene
Assume:
Diploid organisms
Sexual Reproduction
Non-overlapping generations
Random mating
Male-females have the same allele frequencies
Large population, No migration
No mutations, no selection on the alleles under study
Hardy-Weinberg equilibrium:
AA
aa
Aa
aA
Random mating
P ( A)  p
P(a)  q
Non overlapping
generations
AA
aa
Aa
aA
P( AA)  p 2
P( Aa)  2 pq
P(aa)  q 2
With the model assumption, equilibrium is reached within one generation
Testing Hardy-Weinberg using chi-square statistics
HW is over simplifying everything, but can be used as a baseline to test
if interesting evolution is going on for some allele
Classical example is the blood group genotypes M/N (Sanger 1975) (this
genotype determines the expression of a polysaccharide on red blood cell surfaces – so
they were quantifiable before the genomic era..):
Observed
HW
MM
298
294.3
MN
489
496
NN
213
209.3
P( AA)  p 2
P( Aa)  2 pq
P(aa)  q 2
2
(obs  exp)


 exp
Chi-square significance can be computed from the chi-square
distribution with df degrees of freedom.
Here: df = #classes - #parameters – 1 = 3(MN/NN/MM) – 1 (p) – 1 = 1
2
 0.22
Recombination and linkage
Assume two loci have alleles A1,A2, B1,B2
Linkage equilibrium:
Only double Heterozygous can allow
recombination to change allele frequencies:
P ( A1 B1 )  p1q1
A1 B1
P ( A1 B2 )  p1q2
P ( A2 B1 )  p2 q1
P ( A2 B2 )  p2 q2
A2 B2
A1B1/ A2B2
A1B2/ A1B2
A1 B2
A2 B1
The recombination fraction r: proportion of recombinant gametes generated from double
heterozygote
For different chromosomes: r = 0.5
For the same chromosome, function of the distance and possibly other factors
A1 B2
Linkage disequilibrium (LD)
A2 B1
P11  P( A1B1 ), P12  P( A1B2 ), P21  P( A2 B1 ), P22  P( A2 B2 )
r
A1 B1
Recombination on any A1- / -B1
A2 B2
No recomb
Next generation:
P11'  (1  r ) P11  rq1 p1
A1 B1
P11'  q1 p1  (1  r )( P11  p1q1 )
Define the linkage disequilibrium parameter D as:
A2 B2
1-r
A1 B1
D
A2 B2
D  P11  p1q1
Dn  (1  r ) Dn 1  (1  r ) n D0
r=0.05
r=0.5
r=0.2
D  P11P22  P12 P21
Generation
Linkage disequilibrium (LD) - example
blood group genotypes M/N and S/s. Both alleles in Hardy-Weinberg
For M/N –
For S/s –
p1 = 0.5425
q1 = 0.3080
Observed
p2 = 0.4575
q2 = 0.6920
unlinked
MS
484
334.2
Ms
611
750.8
NS
142
281.8
Ns
773
633.2

2
 (obs  exp)

 exp
2
 184.7
Linkage equilibrium highly unlikely!
D  P11P22  P12 P21  0.07
Sources of Linkage disequilibrium
LD in original population that was not stabilized due to low r
Genetic coadaptation: regions of the genome that are not subject to
recombination (for example, inverted chromosomal fragments)
Admixture of populations with different allele frequencies:
D0
D0
P11  0.0025
P11  0.9025
P12  0.0475
P12  0.0475
P21  0.0475
P21  0.0475
P22  0.9025
P22  0.0025
P11  0.4525
P12  0.0475
P21  0.0475
P22  0.4525
D  0.2025
Population substructure
The HW theory assumed population are randomly mating
We mentioned that species are suppose to be isolated genetically, but even inside a
species, the flow of information is never uniform
Subpopulation structure would result in low
heterozygosity
This is because (different) alleles would be
fixated in different sub-populations
We can compute the average heterozygosity
predicted by HWE from allele frequencies: H=2pq
HS – in each population use frequency to compute HWE heterozygosity and average
HR – in each region use frequency to compute HWE heterzygosity and take a weighted
average
HT – for the entire population use frequency to compute HWE heterzygosity and average
Wrights fixation index F
Comparing one level in the hierarchy to another
Provide indication to the level of genetic differentiation
in the population
HR  HS
HT
H  HR
 T
HT
FSR 
FRT
0<F<1, F<0.05 is considered quite low, F>0.25 is considered very high
Population substructure – (Dobzhansky and Epling
1942)
Frequency of recessive allele (blue flower color) in “desert snow” flowers (Lynanthus parruae)
0.717
0.000
0.032
0.573
FSR  0.1589
0.005
0.657
FRT  0.3299
0.000
0.000
0.009
0.302
0.000
0.007
0.004
0.000
0.000
0.504
0.002
0.008
0.005
0.126
0.010
0.068
0.000
0.339
0.000
H  0.4995
More significant difference
among regions than
inside them
0.106
0.224
0.000
0.014
0.411
H  0.0272
H  0.3062
Each point represent ~4000 plants over 30 square miles of the Mohave desert
Inbreeding
A population with inbreeding will undergo reduction in heterozygosity
For example, self-fertilization in plants
The inbreeding coefficient: F 
H0  H I
H0
H0 – the random mating heterozygosity
HI – observed (inbreeding) heterozygosity
In fact F is identical to the Fixation index F and can be interpreted as measuring the
probability that two alleles are identical by descent - autozygotes
The increase in rare-alleles homozygosity for inbreeded population is frequently detrimental
Regular mating schemes in the lab and field: Selfing, Sib-mating, Backcrossing to single
individual from a random bred strain
Assortative mating:
positive (height in human)
negative (cases in plants)
The hapmap project
1 million SNPs (single nucleotide polymorphisms)
4 populations:
30 trios (parents/child) from Nigeria (Yoruba - YRI)
30 trios (parents/child) from Utah (CEU)
45 Han chinease (Beijing)
44 Japanease (Tokyo)
Haplotyping – each SNP/individual
No just determining heterozygosity/homozygosity – haplotyping completely resolve the
genotypes (phasing)
Because of linkage, the partial SNP
Map largely determine all other SNPs!!
The idea is that a group of “tag SNPs”
Can be used for representing all genetic
Variation in the human population.
This is extremely important in association
studies that look for the genetic cause of
disease.
Correlation on SNPs between populations
Recombination rates in the human population: LD blocks
Recombination rates in the human population
Recombination rates are highly non uniform – with major effects on genome structure!
Mutations
Simplest model: assume two alleles, and mutations probabilities:
Pr( A  a)  
Pr( a  A)  
If the process is running long enough, we will converge to a stationary distribution:
Pr( A) 
A

 


a
Populations are however finite, and this create random genetic drift
A random allele have a significance change to be eliminated, even in one generation:
1
2N
sampling
(1 
1 2N
)  1/ e
2N
Drift
Figure 7.4
Experiments with drifting fly populations: 107 Drosophila melanogaster populations. Each
consisted orignally of 16 brown eys (bw) heterozygotes. At each generation, 8 males
and 8 females were selected at random from the progenies of the previous generation.
The bars shows the distribution of allele frequencies in the 107 populations
Drift, fixation, and the neutral theory
If sampling is random, the chance of ultimate fixation is
1
2N
Simply because one allele must become fixated (and there are 2N to begin with).
According to the neutral theory fixation of neutral alleles play a major role in driving
divergence of populations.
This is in contrast to the selectionist view that stress adaptive evolution as the major force
for fixation of new alleles.
The controversy around the neutral theory seems like something that belongs to the past,
since it was heated around question of evolution in protein coding loci, and densely
coded genomes. Today we realize that genomic information is distributed in a way that
should certainly allow neutral or almost neutral mutations a considerable freedom in
large parts of the genome..
There are still critically important questions on how strong is the neutrality assumption in
different parts of the genome – we’ll look at this question later.
Wright-Fischer model for genetic drift
∞
gametes
N
individuals
N
individuals
∞
gametes
We follow the frequency of an allele in the population, until fixation (f=2N) or loss (f=0)
We can model the frequency as a Markov process with transition probabilities:
 2 N  i  
i 

Tij  
 1 

j
2
N
2
N
 



j
2N  j
Sampling j alleles from a
population 2N population
with i alleles.
In larger population the frequency would change more slowly (the variance of the binomial
variable is pq/2N – so sampling wouldn’t change that much)
Diffusion approximation and Kimura’s solution
Fischer, and then Kimura approximated the drift process using a diffusion equation.
 ( x, t )
The density of population with frequency x..x+dx at time t
J ( x, t )
The flux of probability at time t and frequency x
The change in the density equals the differences between the fluxes J(x,t) and
J(x+dx,t), taking dx to the limit we have:


 ( x, t )   J ( x, t )
t
x
The if M(x) is the mean change in allele frequency when the frequency is x, and V(x) is
the variance of that change, then the probability flux equals:
1 
V ( x) ( x, t )
2 x


1 
V ( x) ( x, t )
 ( x, t )   M ( x) ( x, t ) 
t
x
2 x 2
J ( x, t )  M ( x) ( x, t ) 
Heat diffusion
Fokker-Planck
Kolmogorov Forward eq.
M  0,V ( x) 
x(1  x)
2N

1 
x(1  x) ( x, t )
 ( x, t )  
2
t
4 N x
Changes in allele-frequencies, Fischer-Wright model
After about 4N generations, just 10% of the cases are not fixed and the distribution
becomes flat.
Absorption time and Time to fixation
According to Kimura’s solution, the mean time for allele fixation, assuming initial probability
p and assuming it was not lost is:
4N
(1  p) log( 1  p)
tˆ1 ( p)  
p
The mean time for allele loss is (the fixation time of the complement event):
4N
( p) log( p)
tˆ0 ( p)  
1 p
Effective population size
4N generations looks light a huge number (in a population of billions!)
But in fact, the wright-fischer model (like the hardy-weinberg model) is based on many nonrealistic assumption, including random mating – any two individuals can mate
The effective population size is defined as the size of an idealized population for which
the predicted dynamics of changes in allele frequency are similar to the observed ones
For each measurable statistics of population dynamics, a different effective population size
can be computed
For example, the expected variance in allele frequency is expressed as:
V ( pt 1 ) 
pt (1  pt )
2N
But we can use the same formula to define the effective population size given the variance:
V ( pt 1 ) 
pt (1  pt )
2Ne
Effective population size: changing populations
If the population is changing over time, the dynamics will be affect by the harmonic mean of
the sizes:
Ne 
t
 1
1
1 



 .. 
N
N
N
1
t 1 
 0
So the effective population size is dominated by the size of the smallest bottleneck
Bottlenecks can occur during migration, environmental stress, isolation
Such effects greatly decrease heterozygosity (founder effect – for example Tay-Sachs in
“ashkenazim”)
Bottlenecks can accelerate fixation of neutral or even deleterious mutations as we shall see
later.
Human effective population size in the recent 2My is estimated around 10,000 (due to
bottlenecks).
Effective population size: unequal sex ratio, and sex
chromosomes
If there are more females than males, or there are fewer males participating in reproduction
then the effective population size will be smaller:
Na  Nm  N f
Ne 
4Nm N f
Nm  N f
Any combination of alleles
from a male and a female
So if there are 10 times more females in the population, the effective population size is
4*x*10x/(11x)=4x, much less than the size of the population (11x).
Another example is the X chromosome, which is contained in only one copy for males.
Ne 
1
2
1  p q  4  p f q f 
p  pm  p f , Var ( p)   m m   
3
3
9  N m  9  2 N f 
9Nm N f
4Nm  2N f
 1
4
p  pm  p f ,Var ( p)  pq

 9N
 m 18 N f

pq


 2 9 N m N f

 4 N m  2 N f



Testing neutrality
The drift process have clear dynamics. We are usually interested in these dynamics as a
baseline for testing hypotheses on non-neutral evolution
Such tests require predictions on the behavior of concrete statistics that we can measure
from a population
For example, we can sequence alleles and count how many polymorphic sites exist in a
gene and what are their frequencies.
We can also perform evolutionary comparisons among different sites – we will focus on
these later in the course.
Non neutral
population
dynamics
sp1
sp2
sp3
sp4
sp5
Slow evolution
Infinite alleles model
Assuming a gene with multiple loci, we can think of the number of possible alleles as much
larger than the population
In this model, the probability of generating the same mutation twice is considered 0
One can then ask how many distinct alleles should we observe given a neutral process and
a certain mutation probability
Alternatively, one can ask what will be the probability of autozygosity F (identity by descent)
1
 1 
2
Ft  
)(1   ) 2 Ft 1
(1   )  (1 
2N
 2N 
(picking up two autozygous alleles and not
mutating them, or picking up the same
allele twice)
Looking for steady state and neglecting
factors that depends on 2, /N:
Fˆ 
1
1  4 N
Because of our model, F is also the fraction of homozygous individuals Fˆ 
4N
1
  pi2
1  4 N
i
Testing the infinite alleles model
Fˆ 
1
4 N
, Hˆ 
1  4 N
1  4 N
The Ewens formula enable us to predict the number of alleles (k) we should
observe when sampling n times from a population with q=4N, assuming
the infinite allele model :
E (k )  1 
q

q
q 1 q  2
 .. 
q
q  n 1
The Chinese restaurant process
Testing the infinite alleles model
We can estimate F from k (by finding q from the E(k) formula) –
Fˆ 
1
1

1  4 N 1  q
We use this statistics to test if a given gene behave neutrally (or at least
according to the model):
Figure 7.16,7.17
Not quite neutral
VNTR locus in humans: observed
(open columns) and Ewens
predicted allele counts.
Highly non neutral
F computed from the number of Xdh alleles in 89 D.
pseudoobscura lines gene: 52 had a common
allele, 8 singletons.
Compared to a simulation assuming the infinite allele
model.
Infinite sites model
Instead of looking at an entire gene with many alleles, consider the many loci
consisting the gene and assume that these are changing slowly: most
loci are monomorphic or dimorphic.
Probability of i mismatches in two
random sequences:
 1  q 
Pr( S 2  i )  

 , q  4 N
q

1
q

1



i
 1 
Pr( S 2  0)  
F
q

1


In particular, autozygosity:
Just like we had for the infinite allele model.
If we sample n allele, the number of segregating sites is distributed like:
n 1
1
i 1 i
E (S )  q 
n 1
1 2 n 1 1
V ( S )  q  q  2 Assuming no intragenic
i 1 i
i 1 i
recombination
So we can test neutrality by looking at the number of alleles in a certain sample.
Coalescent theory
Any set of individuals in a population are a consequence of a coalescence
process: a common ancestor giving rise to multiple alleles through
mutation, duplication and recombination.
Such models are in wide use for simulating populations
Application for inferring selection/neutrality or other population dynamics are
becoming reasonable as more data becomes available.
A simple coalescent model look at the gene tree of the k observed alleles
Past
E(T2 )  2 N
E (T3 ) 
2N
6
2N
E (T5 ) 
10
E (T4 ) 
Present
2N
3
Selection
Fitness: the relative reproductive success of an individual (or genome)
Fitness is only defined with respect to the current population.
Fitness is unlikely to remain constant in all conditions and environments
Sampling
probability is
multiplied by a
selection
factor 1+s
Mutations can change fitness
A deleterious mutation decrease fitness. It would therefore be selected
against. This process is called negative or purifying selection.
A advantageous or beneficial mutation increase fitness. It would therefore
be subject to positive selection.
A neutral mutation is one that do not change the fitness.
For mono-allelic populations, selection directly observe the fitness of an allele
For diploid organisms, we should define how the combination of alleles affect fitness.
Selection in haploid populations
Allele
Frequency
Relative fitness
Gamete after selection
Generation t:
A
pt 1
w
pt 1w
B
qt 1
1
qt 1
pt 1w
pt 1w  qt 1
qt 1
pt 1w  qt 1
Ratio as a function of time:
pt
p
 wt 0
qt
q0
Consider continuous time model
Example (Hartl Dykhuizen 81):
E.Coli with two gnd alleles. One allele is
beneficial for growth on Gluconate.
A population of E.coli was tracked for 35
generations, evolving on two mediums,
the observed frequencies were:
Gluconate:
Ribose:
0.4555  0.898
0.594  0.587
For Gluconate:
log(0.898/0.102)-log(0.455/0.545)=35logw
log(w) = 0.292, w=1.0696
Compare to w=0.999 in Ribose.
A (t )  aA(t ), B (t )  bB(t )
A(t ) A(t ) ( a b )t

e
B(t ) B(t )
pw
pq( w  1)
p
The change in allele frequency: p 
pw  q
pw  q
Selection and allele frequency dynamics
Assume:
Genotype
Fitness
Frequency
AA Aa aa
w11 w12 w22
p 2 2 pq q 2
Change in frequency is given by:
(Hardy Weinberg!)
pqw12  q 2 w22
q t 1  2
p w11 2 pqw12  q 2 w22
q 
pq p( w12  w11 )  q( w22  w11 )
p 2 w11 2 pqw12  q 2 w22
In the case of codominance: w11  1, w12  s, w22  2s
s 0
spq
dq
q 

spq

sq
(
1

q
)

1  2sqp  2sq 2
dt
qt 
1
 1  q0   st
e
1  
q
 0 
Selection and fixation
An allele with a beneficial mutation will have an increased frequency in the
gamete pool:
po  (1  s )
1
2N
Its chances to avoid immediate extinction are:
(1 
1 s 2N
1
)  e (1 s )  (1  s )
2N
e
This is a rather modest increase, so even beneficial allele are likely to be
eliminated. For example, s=0.1 would have a loss probability of 0.333
compared to 0.368 for a neutral allele.
For a diploid population, if we assume the fitness of a heterozygous if 1+s
and of a homozygous is 1+2s, it can be computed from the diffusion
approximation that the overall fixation probability will be:
1  e  ( 2 Ne s / N ) /(1 s )
pf 
1  e 4 Ne s /(1 s )
(1 s 1, N e / N 1)

2sN e / N
 2sN e / N
4 Ne s
1 e
Selection and fixation
The fixation time for a neutral allele (assuming fixation was achieved), as we
said before, is averaging at:
t  4N
With a selective advantage, the fixation time is approximated by:
t  (2 / s ) ln( 2 N )
Substitutions
Considering now the entire population, the rate of substitution at a loci equals
the number of mutations times their fixation probability. In the neutral
case, this is very simple:
 1 
K  2 N 

2
N


So neutral evolution is unaffected by the size of the population.
With a selective advantage, the fixation probability is approximated by:
K  2 N (2sN e / N )  4 N e s
So evolution will be more efficient when population is larger, mutation rate is
faster and selection is stronger. The parameter 4Nes is describing the
speed up.
Other types of selection
Over-dominance: heterozygous are better, so there is a possibility for
equilibrium in allele frequencies: few examples, but on famous is
resistance ot malaria and sickle cell anemia in Africa
Frequency-, Density-dependent selection: when the fitness depend on the
frequency of the allele or the population size.
Fecundity selection: different reproductive potential for mating pairs.
Effects of heterogeneous environment: (overdominance?)
Different effects in males and femeals
Effects that apply directly to the haplotype: gametic selection/meiotic drive
(e.g., killing your homologous chromosome reproductive potential)
Kin selection: origin of altruism?
Recombination and selection
Linkage and selection
Linkage interfere with the purging of deleterious mutations and reduce the
efficiency of positive selection!
Beneficial
Beneficial
Beneficial
Weakly deleterious
Selective sweep/Hitchhiking effect
/“genetic draft”
Hill-Robertson effect
Linkage and selection
The variance in allele frequency is used to
define the effective population size
V ( p)  p(1  p) /( 2 N e )
Simplistically, assume a neutral locus is evolving such that a selective sweep is affecting
a fully linked locus at rate . A sweep will fixate the allele with probability p, and we
further assume that the sweep happens instantly:
 1  
Ne
V ( p)  p(1  p)  

N

l

2
N
1  2 N e
e


This is very rough, but it demonstrates the basic intuition here: sweeps reduce the
effective selection in a way that can be quantified through reduction in the effective
population size.
Nl 
Ne
1 2 N eC
C – the average frequency of the
neutral allele after the sweep