Download Estimating Selection Coefficient Using the Ancestral

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Florida State University Libraries
Electronic Theses, Treatises and Dissertations
The Graduate School
2007
Estimating Selection Coefficient Using the
Ancestral Selection Graph
Sonali Joshi
Follow this and additional works at the FSU Digital Library. For more information, please contact [email protected]
THE FLORIDA STATE UNIVERSITY
COLLEGE OF ARTS AND SCIENCES
ESTIMATING SELECTION COEFFICIENT USING THE ANCESTRAL
SELECTION GRAPH
By
SONALI JOSHI
A Thesis submitted to the
Department of Biological Science
in partial fulfillment of the
requirements for the degree of
Master of Science
Degree Awarded:
Summer Semester, 2007
The members of the Committee approve the Thesis of Sonali Joshi defended on June 1, 2007.
Peter Beerli
Professor Directing Thesis
Gavin Naylor
Committee Member
David Swofford
Committee Member
The Office of Graduate Studies has verified and approved the above named committee members.
ii
ACKNOWLEDGEMENTS
I am extremely thankful to my major professor Dr. Peter Beerli for giving me an opportunity
to work with him. His support and patience was invaluable not only in my research but also in
coping with other areas of graduate school. He was always enthusiastic about this project, which
gave me great motivation.
Koffi Sampson took time out of his research to help me understand the intricate mathematics
of the Ancestral Selection Graph. I would also like to thank him for patiently pointing me in the
right direction during my research.
I would like to thank the members of my supervisory committee, Dr. Dave Swofford and
Dr. Gavin Naylor for their insightful comments during the course of this research work. Their
encouragement provided a tremendous boost for my confidence.
Finally, I would like to thank my family for their support and especially my husband, for without
his help I would not be able to accomplish this.
Part of this research was supported by the joint NSF/NIGMS Mathematical Biology program
with NIH grant R01 GM 078985.
iii
TABLE OF CONTENTS
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
v
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vi
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
1. INTRODUCTION . . . . . . . . . . . .
1.1 The coalescent . . . . . . . . . . .
1.2 Genealogies and selection . . . . .
1.3 Summary Statistics . . . . . . . .
1.4 Detecting and measuring selection
2. MODELS . . . . . . . . . . . . .
2.1 Deterministic models . . . .
2.2 Ancestral Selection Graph .
2.3 Trees conditional on sample
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
4
5
6
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
allele configuration
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. 9
. 9
. 10
. 14
3. METHODS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1 Bayesian Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Simulation based Rejection methods . . . . . . . . . . . . . . . . . . . . . . . . . 21
4. RESULTS . . . . . . . . . . . . . . . . . . . . .
4.1 Implementation . . . . . . . . . . . . . . . .
4.2 Results . . . . . . . . . . . . . . . . . . . .
4.3 Testing the Estimator with Simulated Data
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
24
24
25
27
5. CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
iv
LIST OF TABLES
2.1
Rules to resolve branching going up the Ancestral Selection Graph . . . . . . . . . . 13
2.2
Comparing the time to the Ultimate Ancestor (TU A ) of the Ancestral Selection Graph
and the corresponding time to the Most Recent Common Ancestor (TM RCA ) of the
extracted genealogy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3
Branching events looking backward in time . . . . . . . . . . . . . . . . . . . . . . . 15
4.1
Change in Standard Deviations in tree lengths with Selection Coefficient . . . . . . . 28
v
LIST OF FIGURES
1.1
A realization of the coalescent genealogy for a sample of size 4 . . . . . . . . . . . .
2
1.2
Variation in genealogies with the Coalescent. Figure courtesy of P. Beerli. . . . . . .
3
1.3
Sample genealogies - from Hudson & Kaplan [1] (a) A selective sweep (b) A typical
neutral genealogy (c) A balanced polymorphism . . . . . . . . . . . . . . . . . . . . .
5
2.1
Example of an Ancestral Selection Graph, showing branching and coalescing events.
The graph starts with a sample of size 3 and is constructed until the time of the
Ultimate Ancestor (UA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2
Example of an Ancestral Selection Graph and the corresponding extracted genealogy. 12
2.3
Graph showing the increase in the average number of branching events with selection
coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4
Example of an Ancestral Selection Graph with a sample of known sample allele
configuration and two possible genealogies. Only the real branches are a part of the
final genealogy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.1
Graph showing the average number of selected alleles (A2) that arise at the tips of
the ASG, in a simulated sample of size 10. The two curves show the dependence
of the alleles at the tips on the allele selected at the UA. The red curve shows the
average number of A2 alleles when the UA allele is A2 and blue curve is for the UA
allele A1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2
Change in tree lengths with strength of selection . . . . . . . . . . . . . . . . . . . . 26
4.3
Change in mean tree length with σ and sample allele frequencies. Colors show the
allele frequencies of the favored allele (A2) at the tips, blue = 0, green = 0.5 and red
= 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.4
Histograms showing the posteriors of θ using different summary statistics. The
dotted vertical lines show the true value of θ. . . . . . . . . . . . . . . . . . . . . . . 29
4.5
Comparing run lengths - The dotted line shows the true value of σ . . . . . . . . . . 30
vi
4.6
Histograms showing the change in the posterior of σ with different priors for θ. The
dotted vertical lines show the true value of σ. . . . . . . . . . . . . . . . . . . . . . . 30
vii
ABSTRACT
Detecting and measuring selection is of fundamental importance in many population genetic
studies. The objective of this study is to estimate the strength of selection acting at a locus by
studing polymorphism in neutral regions of DNA that are tightly linked to the given locus.
Kingman’s [2] Coalescent theory gives us a model to reconstruct the ancestral history of a
sample for neutrally evolving sites. The Ancestral Selection Graph (ASG), introduced by Krone and
Neuhauser [3] is an extension of the neutral coalescent process and incorporates selection. However,
it’s use has been limited due to computational issues in simulating the graph. In this study I use
the Ancestral Selection Graph conditional on the sample allele configuration, as described by Slade
[4] for efficient simulation of selected genealogies.
Inferences in the coalescent framework are often based on Likelihood or Bayesian approaches.
As population genetic models get more complex there is a growing interest in simulation based
Approximate Bayesian Methods. Tavaré et al.[5] first described an Approximate Bayesian Computation (ABC) method for simulating observations from posterior distributions without the use of
Likelihoods. These methods use summary statistics to study the data and infer parameters.
An estimator of Selection Coefficient is built using the Ancestral Selection Graph to simulate
selected genealogies and the Approximate Bayesian approach to estimate parameters. The effect of
the choice of summary statistics, priors and run lengths on estimation of parameters are explored
using simulated data.
viii
CHAPTER 1
INTRODUCTION
The aim of population genetics is to study within-species variation in genetic data and to infer
parameters of the processes that could cause the variation. The parameters of interest are the
mutation, migration, recombination, population growth rates etc.
Earlier methods based on
summaries of data have given way to computationally oriented statistical methods. This change is
driven by the increase in computational power and the large quantities of data currently available.
Statistical analysis is usually based on finding a model that best explains the variation in the data
and then estimating parameters of the model. Methods like Maximum Likelihood and Bayesian
inference based methods are widely used.
In this study I focus my attention on the genetic variation that is due to selection and develop
a method to measure the strength of selection. The current chapter introduces the coalescent, a
model to simulate neutral genealogies, changes in genetic variation expected due to selection, and
some summary statistics of data used to detect this difference.
Natural selection acts on individuals based on their phenotype or fitness but manifests itself
at the population level as a change in allele frequencies through time. This is fundamental to the
study of evolution. Detecting and estimating selection is a central problem in population genetics
and has been addressed at many levels. Parameters of interest with respect to selection are (1)
the time of the origin of the selected mutation (in case of a selective sweep), (2) the strength of
selection and (3) signatures of selection along a chromosome. Molecular data gives us a look at a
finer level of population dynamics and estimating the strength of selection from DNA data is an
active area of research.
1.1
The coalescent
The neutral theory states that in a single population with no gene flow, most genetic variation is due
to neutrally evolving sites and only mutation and genetic drift shape the variation seen. Though
1
other evolutionary forces like recombination, migration, selection do shape genetic variation, the
neutral theory serves as a null model for expected genetic variation in a given sample.
Most modern statistical population genetics is based on the Coalescent theory described by
Kingman [2] and [6]. The main assumptions of the theory is that the sites under consideration
are evolving neutrally in a single population, with only genetic drift and mutation leading to the
current polymorphisms. It describes the genealogical history of a sample of k genes in a population
of constant size N . The genealogy is reconstructed backward in time until the time of the most
recent common ancestor(TM RCA ) of the sample is found. Coalescent events occur with waiting
times that are exponentially distributed with mean 2/k(k − 1). Time is often measured in units
of 2N generations. The sample size reduces by one when two random individuals are chosen to
coalesce. In the limit when N → ∞, the probability that more than two alleles coalesce at the
same time is neglected. Figure 1.1 shows a coalescent genealogy for a sample of k = 4 .
t1
t2
t3
TMRCA
past
Figure 1.1: A realization of the coalescent genealogy for a sample of size 4
The coalescent is based on two insights [7]: First is that the coalescent can separate the mutation
process from the genealogy process, because the neutral sites do not affect reproductive success.
Hence we can first create the genealogy as described above and then superimpose mutations on it.
Secondly, the neutrality assumption also allows us to create an ancestry of only a sample of genes
without worrying about the whole population. In short genetic drift is modeled by the genealogy
and neutral mutations are modeled by first creating the genealogy and then adding mutations to
2
the tree. Refer to Nordborg [7] for an in-depth review of the coalescent.
The coalescent is thus an efficient tool for simulating genealogies, generating data, and studying
the expected variation in polymorphisms under neutral assumptions. It allows us to develop
algorithms for computer simulations of population samples under various models, to study these
distributions and to estimate likelihoods of the population parameters. Figure 1.2 shows the typical
variation expected in genealogies and hence in genetic data for a given set of parameters. This
predicts that a large variation is expected in neutral data due to random events or chance.
freq. [10-6]
25.
20.
15.
10.
5.
20
40
60
80
Time to MRCA
100
[103 generations]
Figure 1.2: Variation in genealogies with the Coalescent. Figure courtesy of P. Beerli.
As long as we can keep the genealogical and the mutational processes separate, the coalescent
can be extended to incorporate other forces such as recombination, population growth, population
subdivision, migration etc. The effects of selection can be included in the coalescent, only if they
are so strong that frequencies of different genetic backgrounds can be approximated as changing
deterministically. In these cases the allele frequencies are assumed known through the ancestral
generations and the coalescent is modeled for the allelic classes, assuming no selection within them.
But as reproductive success depends on allele type when we consider selection, it becomes difficult
to incorporate selection in the coalescent framework. Chapter 2 describes the Ancestral Selection
Graph, an extension of the coalescent with selection that does require the knowledge of allele
frequencies in the ancestral generations. It modifies the coalescent tree building process to include
branching events and thus forming a network or graph. The final genealogy is extracted from the
3
graph following a set of simple rules that, in effect, give the selected allele an advantage.
1.2
Genealogies and selection
Even though a random sample of individuals from a population is taken for a study, they are
ultimately related by a genealogy. Gene trees or genealogies represent the history of a sample of
genes from a population. Patterns of variation in DNA are shaped by the genealogical history of
the samples. Though we may never know the true gene trees, we can say something about the
expected variation in a sample if we know distribution of trees under various population models.
Kingman’s coalescent gives us a statistical model for gene trees under neutrality. This is essentially
a null model and changes in genealogy can be brought about by migration, bottlenecks, population
growth, population substructure, selection and other forces.
Since we know the distribution of genealogies under the neutral assumptions using the coalescent, a change from this expected value can be seen as an evidence that some other force has shaped
this variation. Selection can be detected similarly as it is expected to change the genealogies. To
compare against a neutrally evolving region, it is customary to look at linked neutral regions of
selected sites to detect and estimate selection.
Differentiating between different causes of molecular variation
Since the forces such as migration, recombination, selection are all acting simultaneously in any
population it is difficult to tease apart their separate effects and any analysis depends much on
our prior beliefs. However, while bottlenecks and migrations should affect the whole genome in the
same way, selection acts on a specific part of the genome.
Estimating selection is further complicated by the fact that weak selection might not be
distinguishable from the action of drift. Allele frequency can also increase in a population by
chance, rather than selection. Even neutral nucleotide diversity has a large variance hence might
be difficult to distinguish from effects of weak selection. Recombination may further reduce these
effects in linked neutral variation. Kim and Stephan [8] describe a statistical method to test the
significance of a local reduction in variation caused by a hitchhiking event to differentiate it from
a reduction due to drift.
Theory suggests that selection would change the genealogy and any linked neutral variation
in a certain expected way. Figure 1.3 shows some changes that selection is expected to cause as
4
compared to tree (b) which is a neutral genealogy.
• Directional selection or Selective sweep (a) - The tree shows a selective sweep event. A
favorable allele originates in a population at a time where all branches emerge and sweeps
through to fixation. It carries with it linked neutral sites (hitchhiking) and effectively shortens
the genealogy.
• Background or Purifying selection - happens when deleterious mutations are purged out by
selection. This removes many linked mutations and again shortens the genealogy [9].
• Balancing selection (c) - as shown by tree maintains variation in a population and effectively
increases the length of the gene trees.
a
c
b
past
Figure 1.3: Sample genealogies - from Hudson & Kaplan [1] (a) A selective sweep (b) A typical
neutral genealogy (c) A balanced polymorphism
1.3
Summary Statistics
Statistics of the DNA data of a sample can capture any change in the tree length or topology.
These can easily be compared to the neutral expectations, since we know the distributions of the
gene trees under neutrality. The following examples of commonly used summary statistics refer to
Figure 1.3.
• Number of Segregating Sites S - is a count of all polymorphic sites in a sample. Any change
in the total length of the coalescent tree would be reflected in the number of variable sites in
5
the data. For example, a locus hitchhiking with a selected site undergoing a selective sweep
as in tree (a) will have a lower S compared to the neutral expectation.
• Average Pairwise Differences, Π - is calculated by taking all pairs of individuals in a population
and computing the average number of differences between them. Any change in branch lengths
or topology would be reflected in the average pairwise distances. Tree (c) with balancing
selection shows that two clusters of haplotypes are maintained in the population, increasing
the value of Π.
• Site frequency spectrum - For n sample sequences, the site frequency spectrum is a random
vector of size n-1, where each position represents the number of sites where that many
number of individuals in the sample have a mutation. In other words it shows the number
of individuals having a single mutation, two mutations and so on. A starlike tree (a), for
example, will have more singleton mutations. It reflects any change in topology of the gene
tree.
• Number of distinct haplotypes H - is a count of distinct haplotypes and is a measure of
Linkage Disequilibrium. A hitchhiking event could lead to strong Linkage Disequilibrium.
Summary statistics are an effective way to characterize a dataset. They can also be used to
estimate certain population parameters accurately.
1.4
Detecting and measuring selection
The first estimators of population parameters were based on summaries of data. Most common
among them being Watterson’s [10] estimator of θ, which is based on the number of segregating
sites in a dataset, corrected for the sample size. If k is the number of segregating sites and n is the
sample size, then the estimate of θ is given by,
θ = k/a
where a = 1 +
1
2
+ ... +
1
n−1 .
The Tajima’s [11] estimator of θ is based on the average pairwise distances. If Πn is the average
number of pairwise distances in a sample of size n, then
θ = Πn
6
Historically the focus has been on detecting the presence or signature of selection from molecular
data rather than measuring the strength of selection.
Various tests that measure any departure from the estimated values under neutrality were
developed to detect the presence or absence of selection.
Tajima’s’ D
Tajima’s D [12] makes use of the above mentioned ways of estimating θ to come up with a selection
test. The assumption here is that in the neutral case the two estimates will be the same. Since
k ignores the frequency of mutations, it is affected by occurrence of rare alleles. Whereas, the
calculation of Πn takes the frequency into account. Tajima’s D is roughly defined as the difference
between the two estimates and scaled appropriately.
D = θ(Πn ) − θ(k)
A selective sweep reduces variation and k recovers faster than Π, giving a negative value of D.
Balancing selection maintains variation by forming clusters of haplotypes and increases Π, leading
to a positive D.
McDonald-Kreitman test
Because synonymous variation is not under selection, selection can be detected from the difference in
the synonymous and nonsynonymous variation in a protein coding region. McDonald and Kreitman
[13] proposed a statistical test based on this observation, comparing the number of fixed sites to the
number of polymorphic sites. Ratio of amino acid replacement to silent polymorphisms within a
population is compared to replacement to silent fixed differences of the population with an outgroup.
Under neutrality the ratios are expected to be equal. Any departure is an evidence for selection.
Allele Frequencies
Variation in allele frequencies among populations can be an effect of natural selection. All neutral
polymorphisms that are due to mutation and drift are expected to have similar variation in
geographically isolated populations. FST based statistics are used to detect this variation.
Linkage Disequilibrium
Directional selection may cause linkage disequilibrium (LD) as a selected allele that sweeps to
fixation takes along with it other linked sites, thereby reducing variation. Reduction in variation in
7
the linked regions depends on the strength of selection, recombination and time elapsed since the
selection event. Strong LD may be associated with alleles if selection has recently increased their
frequencies.
While presence of selection can be shown in many cases, the strength of selection cannot be
measured by these methods. For example, Verrelli et al. [14] use various summaries of data to
find evidence of balancing selection in human G6PD data. The aim of this work is to build an
estimator to measure the strength of selection. Computational methods based on summaries of
data are gaining popularity due to their ease of implementation. Przeworski [15] developed a
summary-based rejection method to estimate the time since fixation of a beneficial allele.
Briefly, the goals of this work are:
1. To study the Ancestral Selection Graph, a coalescent based model with selection, to get the
expected variation in genealogies due to selection.
2. To study inference using simulation and summary statistic based Approximate Bayesian methods
3. To build a selection strength estimator based on the Ancestral Selection Graph using the
technique of Approximate Bayesian Computation [16].
The following chapters describe the model and methods in further detail.
8
CHAPTER 2
MODELS
Coalescent theory gives us an elegant way to simulate genealogies for a sample of genes. Models that
incorporate selection into the coalescent framework fall into two categories based on the way they
treat the change in allele frequencies through generations, deterministic and stochastic models.
Deterministic methods assume that selection is strong and that allele frequencies through the
ancestral generations are known. The following sections discuss these models in further detail.
2.1
Deterministic models
This approach assumes that strong selection overcomes the effects of drift and that allele frequencies
are expected to change deterministically.
These methods first calculate the allele frequencies
through the ancestral generations deterministically. Consider selection acting at a locus with two
alleles A1 and A2. The populations of each allele type through ancestral generations can be thought
of as an ‘allelic class’. Within each ‘allelic class’ there is no selection and hence we can use the
standard coalescent to model genealogies for each class. The effect of balancing selection by this
method reduces to a two population problem with migration taking the place of mutation between
classes [17]. A selective sweep can be thought of as increasing the frequency of a selected allele
deterministically and can be modeled as a population growth [18].
Slatkin [19] described an importance sampling method to simulate the history of a selected allele
by simulating backwards from the current frequency until the allele is lost. Hudson and Kaplan
[20] studied the effects of overdominant selection, assuming that frequencies of two alleles are fixed.
Here, the problem reduces to two island model, with mutation and recombination playing the role
of migration.
These methods clearly perform poorly when selection is weak or of comparable strength to drift.
Though the mathematical expressions for the branch length changes in a sample of size two have
been derived, it is difficult to generate the entire genealogy efficiently.
9
2.2
Ancestral Selection Graph
The stochastic method is the Ancestral Selection Graph proposed by Krone and Neuhauser [3]. The
Ancestral Selection Graph (ASG) is an extension of the neutral coalescent process that incorporates
selection. This was the first attempt to simulate selected genealogies from the coalescent without
assuming allele frequencies known through the ancestral generations.
The main problem in incorporating selection is that we lose the simplicity of the coalescent
assumption of neutrally evolving sites and we cannot separate mutation process from the genealogy
building process. To overcome this problem, a network or graph called the Ancestral Selection
Graph is constructed for the sample of size k with the true genealogy embedded in it. In addition
to coalescent events the graph contains branching events which allow genes to take alternative paths
as the graph evolves. This separates the mutation process from the graph building process. The
final genealogy is then extracted from the graph based on certain rules which give the favored allele
selective advantage. The following sections describe the model.
T0
T1
T2
Branching event
T3
Coalescent
event
T4
UA
past
Figure 2.1: Example of an Ancestral Selection Graph, showing branching and coalescing events.
The graph starts with a sample of size 3 and is constructed until the time of the Ultimate Ancestor
(UA)
This model is described in detail in Krone and Neuhauser [3] and Neuhauser and Krone [21].
The model as described here, assumes a haploid population evolving according to the Moran model.
A single locus with two alleles A1 and A2, is under selection, with A2 having a selective advantage
s. The population size N is large and constant. Mutation is symmetric between the alleles with
10
rate µN .
Population model
The model used to describe the theory is the continuous time Moran model with selection and
mutation. However, as the the results are obtained using diffusion approximation it applies equally
to the discrete time Wright-Fisher model in the limit N → ∞. The Moran model is described as
a continuous time Markov process in which a random individual is chosen to reproduce at a given
rate and the offspring replaces an individual thereby keeping the population size constant. Thus,
this is essentially as birth-death process which can be analyzed using Markov chain theory.
Diffusion approximation
The above Markov process is studied as a diffusion process by rescaling the parameters. The
diffusion theory holds in the limit as the population size tends to infinity. Time is rescaled and
measured in units of N generations. The other parameters are rescaled as
N sN → σ and N µN → θ as N → ∞
where σ and θ are positive and finite. The unique stationary distribution of this process has density
f (x), which is a special case of Wright’s formula.
f (x) = Kxθ−1 (1 − x)θ−1 e−σx ,
where x is the frequency of either allele, and K is a normalizing constant.
The model thus applies to large populations where mutation rate and selection coefficient are
small. The standard diffusion holds for any σ and θ, as long as these remain finite as N tends to
infinity. An excellent review about the limitations of these assumptions is given by Wakeley [22].
Constructing the genealogy using the Ancestral Selection Graph
The behavior of the Ancestral Selection Graph(ASG) evolving according to the Moran model is
derived using the biased voter model. For a full description of this model refer to [3].
The ASG construction involves branching events along with the coalescent events going
backward in time. A coalescent event reduces the sample size by one while a branching event
increases the sample size by one in the graph. The branching event does not add an ancestor to
the real genealogy, but just serves to give an alternative path the genealogy could take so as to
11
give a selective advantage to the favored allele. This advantage is achieved by the way branching
is resolved while extracting the true genealogy from the graph.
The basic steps in getting the genealogy can then be summarized as:
• Constructing the graph - If the sample size is k, at any given time, the rates at which coalescing
or branching events happen are defined by
Coalescing : k → k − 1 at rate k(k − 1)/2
Branching : k → k + 1 at rate kσ/2
The two branches emerging out of a branching node and are labeled as incoming and
continuing respectively.
These rates are scaled by N to get branch lengths in units of
generations. Starting from the current sample of k alleles this process is run backwards
in time until the sample size reaches 1 for the first time. This ancestor is referred to as the
Ultimate Ancestor(UA), which is not necessarily the most recent common ancestor M RCA
of the samples.
T0
T1
?
A1
A1
A1
A1
A2
A1
T2
A2
A1
Mutation
Selected
branch
T3
A1
A1
MRCA
T4
A1
UA
past
Figure 2.2: Example of an Ancestral Selection Graph and the corresponding extracted genealogy.
• Adding mutations - the branching makes the mutation process independent of constructing
the graph. Starting from the TU A we add mutations according to a Poisson process with rate
θ/2 independently along each branch of the graph. The type of U A is then chosen according
12
to the stationary distribution of the process and is evolved up to the tips depending on the
location of the mutations.
• Extracting the true genealogy - We then travel up the graph to extract the genealogy. At each
branching node we face a choice of which branch to keep in the real genealogy. This is where
the selective advantage of the allele matters. If both alleles coming towards the branching
point are the same then we can choose to keep either branch with equal probability, but if
the alleles are different then the branch containing the selected allele A2 is always kept in the
final genealogy. Table 2.1 shows the rules to resolve branching. This leads to an increase in
the favored allele.
Table 2.1: Rules to resolve branching going up the Ancestral Selection Graph
Branch 1
A1
A1
A2
A2
Branch 2
A1
A2
A1
A2
Keep Branch
Either 1 or 2
Branch 2
Branch 1
Either 1 or 2
It is easy to see that if σ = 0 then there will be no branching and the method reduces to a
simple coalescent. The branching rate depends on the strength of selection, while coalescent rate
depends only on the number of ancestors present in the sample at any given time. Notice that as
σ increases, the rate at which the events happen also increases, which means that the number of
branching events increase. This does not necessarily mean that the total tree length will increase,
as the branch lengths are decreasing correspondingly.
Extending the Ancestral Selection Graph
Neuhauser and Krone [21] describe a number of models to which the ASG can be extended, amongst
which are the K-allele model, the infinitely many alleles model, infinitely many sites model and
diploid models. These models give more complicated graph structures than the haploid model
discussed here. For example, the rules for resolving branchings change for K-allele models while
for the diploid model a branching event leads to three branches instead of two.
13
Drawbacks of the Ancestral Selection Graph
One apparent drawback of this method is that as the number of branches increase with σ the size
of the graph becomes too large. As the entire graph is required for the embedded genealogy to be
extracted, it needs to be stored in the computer memory. Even for low σ values this number gets
too large for the computer memory. This limits the use of ASG to very low σ values, typically
σ ≤ 12. Figure 2.3 shows how the average number of branching events increase with the strength
of selection.
Number of branching events
1e+06
600000
200000
0
0
2
6
sigma
10
14
Figure 2.3: Graph showing the increase in the average number of branching events with selection
coefficient
Secondly, as strength of selection increases the branching rate also increases and hence it takes
a very long time for the sample size to reach one. We can see from table 2.2 how the time to the
UA increases with the strength of selection, though the time to MRCA decreases or remains the
same. This shows that the real genetree coalesces much before the UA is reached and a lot of time
is wasted in reaching the UA which is just an artifact of the ASG.
Finally, we have no control over the distribution of allele frequencies at the tips of the final
genealogy. The following section describes attempts to overcome these problems and to get a more
efficient algorithm to generate trees using the ASG.
2.3
Trees conditional on sample allele configuration
Typically in simulating selected genealogies using the ASG we have no control over the frequencies
of the alleles that arise at the tips, as these depend on the assumed ancestral allele at the U A and
the strength of selection. But often, in the population or sample under study we know the allele
14
Table 2.2: Comparing the time to the Ultimate Ancestor (TU A ) of the Ancestral Selection Graph
and the corresponding time to the Most Recent Common Ancestor (TM RCA ) of the extracted
genealogy
σ
0
1
2
3
4
5
6
7
8
9
10
Mean TM RCA
2.5495
2.6678
2.5929
2.5304
2.4271
2.3466
2.3109
2.1818
2.0845
2.0596
2.0645
Mean TU A
2.5495
2.9734
3.1277
3.3006
3.6272
4.5880
7.2350
13.3422
36.0303
58.4277
227.4843
frequencies. We might have to reject a lot of trees if we wish to generate genealogies that explain
the sample at hand. Hence it might be more useful to draw trees conditional on the known sample
configuration.
One way to look at the extra branches in an ASG is to label them as real and virtual [21]. The
branches that will be a part of the final genealogy are called real and the extra branches produced
at the branching events are called virtual. If we know the allele type at the tips we can assign a
type to each branch going backward in time while constructing the ASG, without the knowledge
of the type of the U A.
Table 2.3: Branching events looking backward in time
Branching allele
A1
A2
A2
Emerging alleles
A1, A1
A1, A2
A2, A2
Advantage mattered?
No
Yes
No
While table 2.1 gives rules to resolve branching going up the graph, table 2.3 states the same
rules going down the graph as it is being constructed [4]. As an example refer to figure 2.4 showing
a ASG of a sample of three A1 alleles, with no mutation on the graph. While constructing the graph
backwards, it is straightforward to assign type real or virtual to each branch at a branching event
15
A1
A1
A1
A1
A1
A1
A1
A1
A1
T0
T1
T2
real
virtual
virtual
real
T3
T4
ASG
Possible
Genealogy - 1
Possible
Genealogy - 2
(a)
(b)
(c)
Figure 2.4: Example of an Ancestral Selection Graph with a sample of known sample allele
configuration and two possible genealogies. Only the real branches are a part of the final genealogy.
as we know exactly which allele emerges out of this node. In this case the allele is A1 and hence
labelling of the branches is arbitrary. We can see from the figure that the extracted genealogy from
the ASG (a) can take one of the two forms, either (b) or (c) with equal probability. Since selecting
either branch is consistent with A1 allele at the tip, it does not matter which branch is kept in the
final genealogy. Here as the decision of which branch to keep in the final genealogy is made at each
branching point, the construction of the ASG can end when the M RCA is reached(the number of
real samples reaches one).
Based on this fact, Slade [4] [23], proposed modifications to the ASG algorithm to draw a
genealogy conditional on the given sample allele configuration. Starting from the known allele types
in the sample, the ASG is constructed going backward in time as before, but the final genealogy is
identified at the same time. This can be done only when the branches can be labelled as real or
virtual at each event i.e the events themselves need to be classified as real or virtual. Three events
are considered simultaneously, branching, coalescing and mutation events. Going backward in time
16
the rate of the three events is given by
Coalescing : k → k − 1 at rate k(k − 1)/2
Branching : k → k + 1 at rate kσ/2
Mutation : at rate θ/2
In the algorithm a count of real and virtual ancestors is kept at all times. At a coalescent event
only two similar alleles can coalesce, reducing the sample size by one. At a branching event, one
extra ancestor of the same type is created. The number of real ancestors in a coalescent cannot
increase, hence the extra branch created at the branching event is labelled as virtual and it is not
a part of the final genealogy. The other branch is then called real. At a mutation event the type of
allele changes from A1 to A2 and vice versa.
Thus the number of real ancestors eventually reaches the size of one, in which case the MRCA
is obtained. This method has the advantage of reducing the computer memory problems as the
entire graph need not be drawn out, but it might still take a long time to converge to a sample size
of one depending on the mutation rate and selection coefficient.
To construct the ASG conditional on the sample allele configuration we need to keep a track
of the three events(branching, coalescing and mutation), the type of event(real or virtual ) and the
type of allele (A1 and A2). The probabilities for each event can then be derived [4]. In the following
derivations, i and j refer to the allele types A1 and A2. ri = number of real ancestors of type i,
vi = number of virtual ancestors of type i and n = (n1 , n2 ) is the initial sample size. Similarly
r = (r1 , r2 ) and v = (v1 , v2 ) are the total number of real or virtual ancestors.
Probability of coalescence
An event is a real coalescence event(a part of the final genealogy) only if both the ancestors involved
are real.
Probability of real coalescence = P(Coalescence) * P(i or j) * P(two real )
ni − 1 ri (ri − 1)
ri (ri − 1)
n(n − 1)
=
n(n − 1 + σ + θ) n − 1 ni (ni − 1)
ni (n − 1 + θ + σ)
An event is a virtual coalescence if either or both ancestors are virtual. This is not reflected in the
genealogy, but just reduces the number of virtuals by 1.
Probability of virtual coalescence = P(Coalescence) * P(i or j) * P(two virtual or one of each)
n(n − 1)
ni − 1 vi (vi − 1 + 2ri )
vi (vi − 1 + 2ri )
=
n(n − 1 + σ + θ) n − 1
ni (ni − 1)
ni (n − 1 + θ + σ)
17
Probability of mutation
Probability of real mutation = P(Mutation) * P(i or j) * P(real )
ni + 1 rj
θ
n(n − 1 + σ + θ) n nj
Probability of virtual mutation = P(Mutation) * P(i or j) * P(virtual )
ni + 1 vj
θ
n(n − 1 + σ + θ) n nj
Probability of branching
Both real and virtual ancestors can branch but the new branch created is always virtual as the
number of real ancestors cannot increase. Table 2.3 shows the two cases to be considered. First
when both emerging alleles are of the same type. A virtual allele of either type can be gained.
Probability of branching (gain of virtual Ai ) = P(Branching) * P(i or j)
ni (ni + 1)
σ
n(n − 1 + σ + θ) n(n + 1)
Second, a gain of an A1 allele when a A2 allele branches.
Probability of branching (gain of virtual A1 )= P(Branching) * P(A1 )
σ
2n2 (n1 + 1)
n(n − 1 + σ + θ) n(n + 1)
Based on these probabilities, the normalizing factor for Monte Carlo simulation of genealogies is
given by equation 3.1 in [4].
h(r, v) =

1

n−1+θ+σ
X
(ri − 1) +
i∈(1,2),ri >0
X
i∈(1,2),vi >0


(vi − 1 + 2ri )

X
X
(r
+
1)
(v
+
1)
θ
i
i


+
+
n−1+θ+σ
n
n
i,j∈(1,2),rj >0,i6=j
i,j∈(1,2),vj >0,i6=j


X ni (vi + 1) 2n2 (v1 + 1)
σ


+
+
n−1+θ+σ
n(n + 1)
n(n + 1)
i∈(1,2)
The probabilities of branching, coalescing or mutation events are then given by
18
ri − 1
h(r, v)(n − 1 + θ + σ)
vi − 1 + 2ri
Virtual coalescent : vi → vi − 1 at rate =
h(r, v)(n − 1 + θ + σ)
θ(ri + 1)
Real mutation : ri − 1, rj + 1 at rate =
h(r, v)n(n − 1 + θ + σ)
θ(vi + 1)
Virtual mutation : vi − 1, vj + 1 at rate =
h(r, v)n(n − 1 + θ + σ)
σ(ni (vi + 1) + δi,1 2n2 (v1 + 1))
Branching : vi → vi + 1 at rate =
h(r, v)n(n + 1)(n − 1 + θ + σ)
Real coalescent : ri → ri − 1 at rate =
where δi,1 is the Kronecker delta, and i = 1, 2, i.e.,
1, i = j
δij =
0, i 6= j
The transition probabilities sum to one. The process is run backwards in time until the real sample
size reaches one, which is the M RCA of the genealogy. While constructing the genealogy only the
real branches and nodes are constructed as a structure in the computer memory. The branching
events add one virtual ancestor which only serves to increase the number of ancestors for purpose
of rate calculations and has no real significance in terms of the genealogy. Similarly, a coalescent
event of virtual type just decreases the number of virtuals by one.
While constructing the graph the probabilities of real or virtual coalescence are calculated only
if the number of ancestors are 2 or more. This avoids number of ancestors going negative.
With these modifications, it is not necessary to have all the branches and nodes in the computer
memory, and it is also not necessary to continue drawing the graph until the U A. Thus it is possible
to simulate genealogies for much higher σ values than was possible with the standard ASG.
19
CHAPTER 3
METHODS
In most model based statistical analysis, the choice of a method depends on how easily the model
lends itself to calculating the likelihoods and how easy it is to traverse the parameter space.
Typically, these methods do not have analytical solutions and one needs to approximate the
likelihood. Commonly used methods are Monte Carlo(MC) approximations, Importance Sampling
(IS) and Markov Chain Monte Carlo(MCMC).
In the case of the Ancestral Selection Graph, traversing the graph and tree space can be
extremely difficult for MCMC based analysis, though there are some recent advances [24]. MC
might generate a large number of unsuitable trees.
It is much easier to generate trees using the Ancestral Selection Graph and simulate data
on them, rather than changing graph and tree topologies for MCMC methods making them
easy to use in an Approximate Bayesian method. Here we evaluate a selection estimator using
Ancestral Selection Graph as a simulator of selected genealogies. The following sections describe
the motivation behind using the Approximate Bayesian Computation approach.
3.1
Bayesian Methods
Inferences in the coalescent framework are often based on Bayesian approaches. These methods give
the posterior distribution of the parameter of interest, θ. The background information is included
as a prior distribution π(θ). The posterior distribution f (θ|D) for a dataset D is calculated as
f (θ|D) ∝ f (D|θ)π(θ)
These methods have the advantage of allowing background information to be incorporated in the
model as priors, provide posterior probability distributions and integrate out nuisance parameters
[25] [26]. Typically, these inferences use MCMC methods which can potentially give exact solutions,
20
but can be time consuming and difficult to implement [16]. For a review on current trends in
statistical methods refer to Marjoram and Tavaré [27].
3.2
Simulation based Rejection methods
As datasets get larger and population genetic models more complex, there is a growing interest in
simpler simulation based approximate methods. Tavaré et al [5] first proposed a rejection sampling
method to approximate the posterior distribution. The data is replaced by a summary statistic S.
Then instead of accepting a value of (θ) from the prior distribution with probability proportional to
its likelihood (P (D|θ)), they proposed accepting it with probability proportional to the probability
of getting the same S for the simulated prior(P (S = s|θ)) [16]. These methods are possible only
when P (S = s|θ) is easy to calculate.
Other variations on the same theme are proposed when such probabilities are difficult to
calculate. One way is to generate a value from the prior, simulate new data using these prior
values as parameters of our model and accept the prior if the simulated data equals the real data.
The accepted values of the prior give the posterior distribution. No likelihood calculations are
required. A rejection algorithm is the iteration of the following steps for a given set of data D.
• Step 1: Sample a value of the parameter θ from the prior.
• Step 2: Simulate data D′ using your model with the sampled parameter value of θ .
• Step 3: Accept the prior value of the parameter θ if D = D′ , else ignore the prior.
• Step 4: Return to step 1.
3.2.1
Approximate Bayesian Computation(ABC)
In case the data is too complicated then the acceptance rate of the above algorithm becomes very
low. Hence we approximate the data comparison by taking summary statistics of the data. Fu and
Li [28] suggested further improvements to the above algorithm, by calculating summary statistics
of both datasets and comparing them instead of the entire dataset.
Another level of approximation is that instead of an exact match of the simulated and observed
summary statistic, we accept the parameter if the two are sufficiently close [27]. The iterative
algorithm then becomes :
• Step 1: Calculate summary statistic S for given data.
21
• Step 2: Sample a value of the parameter θ from the prior.
• Step 3: Simulate data using your model with the sampled parameter value of θ and calculate
the summary statistic S ′ .
• Step 4: Accept the prior value of θ if |S − S ′ | < ǫ, where ǫ is the margin of error or tolerance,
else ignore the prior value.
• Step 5: Return to step 1.
Since the data is approximated at many levels, the success of an ABC method critically depends
on the choices made for the following:
• Summary Statistic: The suitability(sufficiency) of the summary statistic needs to be evaluated
for the problem at hand. It should exploit the difference in data that is due to the parameter
you are trying to measure.
• Data Simulator: The speed of the method will depend on the efficiency of the simulator used
to generate data. Coalescent based models are well suited for this.
• Tolerance (ǫ): Tolerance needs to be chosen as a tradeoff between accuracy, computation time
and complexity.
• Prior distribution: Since this method does not explore the parameter space beyond the prior,
the true value of parameters needs to be in the range of the prior.
3.2.2
Sufficiency of statistics
In population genetics there is no single sufficient statistic for any problems. Certain statistics
may be adequate for a given parameter. For example Tavaré et al [5] found that their inference
of coalescence times using statistic S(number of segregating sites) was quite close to that obtained
using other methods.
One way to overcome this problem is to use multiple summary statistics. Przeworski [15] used
multiple summary statistics to estimate the time since fixation of a beneficial allele. However, doing
so can reduce the acceptance rates or tolerance(ǫ) levels need to be increased.
22
3.2.3
Improving the ABC estimator
Beaumont et al. [16] proposed modifications to the rejection methods to make the approximation
insensitive to ǫ allowing the use of multiple summary statistics. This is done by using local linear
regression and smooth weighting of summary statistics. Each accepted prior is given a weight that
declines quadratically as a function of |S − S ′ | and then weighted linear regression is used to adjust
the values of priors [26].
To overcome some problems with prior distributions being very different from the posterior,
Marjoram et al. [29] have proposed a MCMC method without the use of likelihoods, that combine
the simplicity of rejection methods and ability of MCMC to sample from areas of high posterior
probability [27].
3.2.4
Comparison to other methods
In contrast to MCMC based analysis, Approximate Bayesian methods are easy to implement.
The stopping criteria can be predetermined (for example, based on number of accepted priors
required), eliminating the problem of convergence as in MCMC. Since ABC generates independent
observations from the prior they can be used in parallel computation. However, they perform
poorly when prior and posterior distributions are very different as many draws from the priors are
rejected. MCMC methods give better results in this case.
Some recent studies have compared the method favorably to MCMC and other alternative
methods. Tallmon et al [26] compared their estimator of effective population size (SummStat) to
three existing moment based and likelihood estimators. They found that the Approximate Bayesian
was the least biased over the full range of parameters investigated. Beaumont et al [16] found that
though approximate methods are faster to implement and take less time for analysis, the MCMC
based methods are always superior in accuracy. Nevertheless it is an interesting approach for initial
analysis of a new Bayesian method.
23
CHAPTER 4
RESULTS
4.1
Implementation
The Ancestral Selection graph(ASG) based simulator for trees with selection conditional on the
sample configuration and the estimator based on Approximate Bayesian Computation(ABC) were
both written using the programming language C++.
The ASG simulates a neutral sequence that is assumed tightly linked (no recombination) to a
two allele selected locus. The alleles are A1 and A2 with A2 having a selective advantage. The
program allows simultaneous estimation of two parameters θ = N µ and σ = N s (µ is the mutation
rate, s is the selection coefficient and N the population size). The priors for both parameters are
uniform between minimum and maximum values input by the user. The program prints out all
the prior values sampled and the respective summary statistics to a file. Analysis and estimation
of parameters is done using functions written in MATLAB. This allows for flexibility in estimating
parameters with different error tolerance levels and combination of summary statistics without
having to run the simulation process each time.
The summary statistics selected are S, the number of segregating sites and π, the average
number of pairwise differences. These statistics were chosen because they were easy to code and
commonly used in inference of selection by other methods. On an average, the program takes
between 2 to 10 hours to sample from 100,000 priors and calculate summary statistics depending
on the parameters θ and σ.
The same program can also be used to simulate DNA datasets for combinations of σ, θ and
sample allele configurations.
24
4.2
4.2.1
Results
Ancestral Selection Graph
Change in frequencies with selection coefficient.
If the Ancestral Selection Graph is simulated as proposed by Krone and Neuhauser (1997), there
is no control over what allele frequencies arise at the tips. It depends on the choice of the Ultimate
Ancestor(UA) and strength of selection. Figure 4.1 shows simulation results (θ = 1, sample size
= 10) of how the average number of selected alleles (A2) change at the tips, as the strength of
selection increases. At σ = 0 the type of alleles at the tips depend highly on the allele selected
at the UA (which incidentally is the same as the MRCA for σ = 0). As the strength of selection
continues to increase, the allele A2 has an advantage and the frequency of A2 alleles at the tips
increases. Eventually it does not matter which allele was selected at the UA.
10
9
A2 frequency
8
7
6
5
4
3
0
2
4
6
8
10
Sigma
Figure 4.1: Graph showing the average number of selected alleles (A2) that arise at the tips of the
ASG, in a simulated sample of size 10. The two curves show the dependence of the alleles at the
tips on the allele selected at the UA. The red curve shows the average number of A2 alleles when
the UA allele is A2 and blue curve is for the UA allele A1.
25
2.6
2.4
Tree Length
2.2
2
1.8
1.6
1.4
1.2
0
2
4
6
8
10
Sigma
Figure 4.2: Change in tree lengths with strength of selection
4.2.2
Properties of trees conditional on sample size
Change in tree lengths with selection coefficient.
Figure 4.2 shows simulation results of the change in mean conditional tree length with the strength
of selection. This data is simulated using the conditional ASG for a sample of all selected A2 alleles
at the tips, allele A1 at the the MRCA and θ = 1. As selection gets stronger it requires less time
for an A1 ancestor to reach a sample configuration of all A2 alleles at the tips.
Mutation-Selection balance
The change in tree lengths depends upon mutation rates, the selection coefficient and the sample
configuration. Figure 4.3 shows simulation results of the change in mean conditional tree length
with the strength of selection and sample allele frequencies. The graph (a) is for θ = 1 and it can
be seen that the mean lengths decrease in all cases, but more slowly when sample allele frequency
is 0.5. Since only like alleles can coalesce, some time is spent in waiting for a mutation to occur
for the final coalescence. This depends on the mutation rate. Graph (b) shows the opposite effect.
Here, as mutation rate is lower (θ = 0.01), the mean tree lengths actually increase with selection.
This is a form of mutation-selection balance.
The mean tree lengths for having all A2 alleles at the tips decreases more rapidly than that
for seeing all A1 alleles. Biologically, this makes sense because all favored alleles, A2 will rise in
26
2.2
0.5
A2 frequency = 0.5
0.45
2
0.4
1.8
0.35
Tree Length
Tree Length
1.6
1.4
0.3
0.25
0.2
1.2
0.15
1
0.1
0.8
0.05
0.6
0
0
2
4
6
8
10
0
2
Sigma
4
6
8
10
Sigma
(a) θ = 1
(b) θ = 0.01
Figure 4.3: Change in mean tree length with σ and sample allele frequencies. Colors show the allele
frequencies of the favored allele (A2) at the tips, blue = 0, green = 0.5 and red = 1
frequency rapidly and are expected to find their ancestor sooner that all A1 alleles.
Variance in tree lengths with the coalescent and selection
Table 4.1 shows the standard deviations of the expected time to the M RCA for θ = 0.01 and
from σ = 0 to 10. We can see that as σ increases, variance in the expected tree lengths goes on
increasing. Also note from figure 4.3 that the mean lengths are decreasing or increasing with σ,
depending on allele frequencies of the sample. Hence the percentage error increases substantially
with σ. This variance will reduce the power of estimating the parameters.
4.3
Testing the Estimator with Simulated Data
Unless otherwise mentioned all data were simulated for a sample size of 10 and a sequence length
of 1000.
Choice of Summary Statistics
Different summary statistics capture information about different aspects of the data. Number of
segregating sites S is known to be a good summary for θ under neutrality. Whereas, average number
of pairwise distances can represent the data more accurately if the tree has two deep branches at
27
Table 4.1: Change in Standard Deviations in tree lengths with Selection Coefficient
σ
0
1
2
3
4
5
6
7
8
9
10
A2 allele frequencies
0
0.5
1
0.01 0.014 0.010
0.031 0.234 0.027
0.042 0.316 0.034
0.048 0.366 0.035
0.049 0.395 0.041
0.052 0.415 0.037
0.053 0.431 0.040
0.055 0.445 0.039
0.055 0.452 0.037
0.054 0.462 0.040
0.054 0.45 0.042
the end, as shown by tree (c) in figure 1.3. Such trees are expected when θ is low and we have a
non-uniform sample at the tips.
To test the performance of summary statistics, data was simulated for θ = 0.01 and σ = 10.
The priors were uniform for θ = 0.0001 to 0.1 and σ = 0 to 50. The sample frequency was chosen
to be 0.5 for each allele. The data were first analyzed using only one sumary statistic S and then
adding the other statistic π. Figure 4.4 shows the histogram for the posterior of θ in each case.
The posteriors look very different and a combination of two statistics gives a result closer to the
true value in this case.
Comparing run lengths
The number of samples from the prior needed to get a reasonable number of accepted values depends
on the parameter space and the prior intervals. Figure 4.5 shows how the mean estimate of σ is
sensitive to the number of priors sampled. Data were simulated for θ = 1 and σ = 10. The criterion
was that at least 1000 priors should be accepted. The epsilon value was changed accordingly. Two
summary statistics were used, S and π.
In this case the estimate does not change much after about 50,000 sampled values from the
prior.
28
(a) Summary statistic : S
(b) Summary statistic : S and π
Figure 4.4: Histograms showing the posteriors of θ using different summary statistics. The dotted
vertical lines show the true value of θ.
Estimating θ and σ together
The genealogy changes in response to both mutation rate and the selection coefficient. If the priors
are too wide then it might be difficult to estimate both parameters together. To test this possibility,
data were simulated for θ = 0.01, σ = 10 and all selected alleles (A2) in the sample.
Figure 4.6 (a) shows the histogram of the posterior for σ when the priors were θ = 0.0001 to 0.1
and σ = 0 to 50. The posterior has a high variance and a mode not close to the true value.
The same data were analyzed again, this time estimating only σ, assuming that θ was known.
This was done by setting the prior as θ = 0.01 and σ = 0 to 50. Figure 4.6 (b) shows the histogram
of the posterior for σ. It can be seen that the posterior estimate of σ changes drastically when the
prior for θ is restricted, and is closer to the true value.
The posterior σ is now closer to the true σ = 10 of the simulated dataset.
29
50
45
40
35
Sigma
30
25
20
15
10
5
0
0
2
4
6
Number of sampled priors
8
10
12
4
x 10
Figure 4.5: Comparing run lengths - The dotted line shows the true value of σ
(a) Priors: θ = 0.0001 to 0.1
σ = 0 to 50
(b) Priors: θ = 0.01
σ = 0 to 50
Figure 4.6: Histograms showing the change in the posterior of σ with different priors for θ. The
dotted vertical lines show the true value of σ.
30
CHAPTER 5
CONCLUSION
It is easy to simulate trees with directional selection using the ASG conditional on the sample
allele configuration. The trees show an increase or decrease in length to the MRCA depending
on the mutation rate, selection coefficient and allele frequencies at the tips. However the variance
associated with the total tree length increases substantially as a function of the selection coefficient.
Slade [30] reached the same conclusions about increase in variance of the time to the MRCA with
an increase in either mutation rate or selection coefficient. Similar levels also applied to diploid
selection. Selection also expands the state space of the graph [30]. This high variation will affect
the precision with which the values of selection coefficients can be estimated.
Mutation rate and selection coefficient together change the shape of the genealogies. A low level
of variation in a data set can be attributed to a high mutation rate with a high selection coefficient
or to a lower mutation rate with no selection. Thus with no background information on the value
of θ (i.e with a wide prior), it might be difficult to estimate both θ and σ concurrently. Having
specific prior information improves the estimates of σ substantially.
Summary statistics capture important information about the variation in data and the underlying genealogy. Multiple statistics can capture more information about the data in certain cases. The
two statistics S, the number of segregating sites, and π, the average number of pairwise differences
capture different information of the underlying genealogy. When the shape of the genealogy is very
different from that of neutral trees, as seen in the case of a mixed sample of two alleles with low
mutation rate, a combination of the two statistics works better than S alone. This is not seen when
the sample has only one allele type.
If we have an efficient simulator for our model of choice and summary statistics that capture
enough information about the data, then efficient applications that are easy to implement can be
built using Approximate Bayesian methods.
31
5.1
5.1.1
Future Work
Performance testing of the estimator
The efficiency and performance of this estimator needs to be tested against multiple datasets with
different values of θ, σ, sample allele frequencies, sequence lengths and various combinations of
priors. The performance of the Approximate Bayesian approach needs to be compared to a MCMC
based method to see if the latter is more robust to the choice of priors.
5.1.2
Improving performance of ABC
The performance of the estimator can be further improved and the analysis made insensitive to the
tolerance levels of comparing the summary statistics, by using weighting and linear regression as
proposed by Beaumont et al. [16].
5.1.3
Extending the ASG
The effect of directional selection on linked neutral loci has maximum impact only when recombination is very low. If recombination is high or a long time has elapsed since fixation, it might
not be possible to see any signs of the selective event in the regions that surround the site under
selection. Introducing recombination into the model will make the model more realistic especially
if we want to get more information from longer sequences. Extensions of this simple haploid, two
allele Ancestral Selection Graph have been proposed by Neuhauser and Krone [21] and Slade [4].
Extending the model to K-allele, diploid models will make it more widely applicable.
5.1.4
Testing with real data
Testing with real data will indicate if this estimator and simulator works for realistic values of
mutation rates, population sizes and selection coefficients.
32
REFERENCES
[1] R. R. Hudson and N. L. Kaplan. The coalescent process and background selection. Philosophical Transactions: Biological Sciences, 349(1327):19–23, 1995. (document), 1.3
[2] JFC Kingman. The coalescent. Stochastic Processes and Their Applications, 13:235–248, 1982.
(document), 1.1
[3] Stephen M. Krone and Claudia Neuhauser. Ancestral processes with selection,. Theoretical
Population Biology, 51(3):210–237, June 1997. (document), 2.2, 2.2, 2.2
[4] Paul F. Slade. Simulation of selected genealogies. Theoretical Population Biology, 57(1):35–49,
2000. (document), 2.3, 2.3, 5.1.3
[5] S. Tavare, D. J. Balding, R. C. Griffiths, and P. Donnelly. Inferring coalescence times from
dna sequence data. Genetics, 145(2):505–518, February 1997. (document), 3.2, 3.2.2
[6] JFC Kingman. On the genealogy of large populations. In J Gani and EJ Hannan, editors,
Essays in Statistical Science, pages 27–43. Applied Probability Trust, London, 1982. 1.1
[7] Magnus Nordborg. Coalescent Theory. Handbook of Statistical Genetics, 2000. 1.1
[8] Yuseob Kim and Wolfgang Stephan. Detecting a local signature of genetic hitchhiking along
a recombining chromosome. Genetics, 160(2):765–777, 2002. 1.2
[9] B. Charlesworth, M. T. Morgan, and D. Charlesworth. The effect of deleterious mutations on
neutral molecular variation. Genetics, 134(4):1289–1303, 1993. 1.2
[10] G. A. Watterson. On the number of segregating sites in genetical models without recombination. Theoretical Population Biology, 7(2):256–276, 1975. 1.4
[11] Fumio Tajima. Evolutionary relationship of dna sequences in finite populations. Genetics,
105(2):437–460, 1983. 1.4
[12] F. Tajima. Statistical method for testing the neutral mutation hypothesis by dna polymorphism. Genetics, 123(3):585–595, November 1989. 1.4
[13] John H. McDonald and Martin Kreitman. Adaptive protein evolution at the adh locus in
drosophila. Nature, 351(6328):652–654, June 1991. 1.4
[14] Brian C Verrelli, John H McDonald, George Argyropoulos, Giovanni Destro-Bisol, Alain
Froment, Anthi Drousiotou, Gerard Lefranc, Ahmed N Helal, Jacques Loiselet, and Sarah A
Tishkoff. Evidence for balancing selection from nucleotide sequence analyses of human g6pd.
Am J Hum Genet, 71(5):1112–28, 2002. 1.4
33
[15] Molly Przeworski. Estimating the time since the fixation of a beneficial allele. Genetics,
164(4):1667–1676, 2003. 1.4, 3.2.2
[16] Mark A. Beaumont, Wenyang Zhang, and David J. Balding. Approximate bayesian computation in population genetics. Genetics, 162(4):2025–2035, December 2002. 1.4, 3.1, 3.2, 3.2.3,
3.2.4, 5.1.2
[17] N. L. Kaplan, T. Darden, and R. R. Hudson. The coalescent process in models with selection.
Genetics, 120(3):819–829, 1988. 2.1
[18] N. L. Kaplan, R. R. Hudson, and C. H. Langley. The “hitchhiking effect” revisited. Genetics,
123(4):887–899, 1989. 2.1
[19] M Slatkin. Simulating genealogies of selected alleles in a population of variable size. Genet
Res, 78(1):49–57, 2001. 2.1
[20] R. R. Hudson and N. L. Kaplan. The coalescent process in models with selection and
recombination. Genetics, 120:831–840, 1988. 2.1
[21] C. Neuhauser and S. M. Krone. The genealogy of samples in models with selection. Genetics,
145(2):519–534, 1997. 2.2, 2.2, 2.3, 5.1.3
[22] John Wakeley. The limits of theoretical population genetics. Genetics, 169(1):1–7, January
2005. 2.2
[23] Paul F. Slade. Most recent common ancestor probability distributions in gene genealogies
under selection. Theoretical Population Biology, 58(4):291–305, 2000. 2.3
[24] Nicoleen Cloete, Geoff K. Nicholls, and David J. Scott. Simulation of ancestral selection
graphs for monte carlo integration. In Jagoda Crawford and A. J. Roberts, editors, Proc. of
11th Computational Techniques and Applications Conference CTAC-2003, volume 45, pages
C391–C404, June 2004. 3
[25] Painter I.S. Shoemaker J.S. and Weir B.S. Bayesian statistics in genetics: a guide for the
uninitiated. Trends in Genetics, 15:354–358(5), 1999. 3.1
[26] David A. Tallmon, Gordon Luikart, and Mark A. Beaumont. Comparative evaluation of a new
effective population size estimator based on approximate bayesian computation. Genetics,
167(2):977–988, June 2004. 3.1, 3.2.3, 3.2.4
[27] Paul Marjoram and Simon Tavaré. Modern computational approaches for analysing molecular
genetic variation data. Nat Rev Genet, 7(10):759–770, 2006. 3.1, 3.2.1, 3.2.3
[28] YX Fu and WH Li. Estimating the age of the common ancestor of a sample of dna sequences.
Mol Biol Evol, 14(2):195–199, February 1997. 3.2.1
[29] Paul Marjoram, John Molitor, Vincent Plagnol, and Simon Tavaré. Markov chain monte carlo
without likelihoods. Proceedings of the National Academy of Sciences of the United States of
America, 100(26):15324–15328, 2003. 3.2.3
[30] Paul F. Slade. Simulation of ’hitch-hiking’ genealogies. Journal of Mathematical Biology,
V42(1):41–70, 2001. 5
34
BIOGRAPHICAL SKETCH
Sonali Joshi was born in India on Dec 26th 1976. She graduated from high school in 1994 and
went on to graduate with a B.Eng. in Electronics and Telecommunications Engineering from Pune
University, India in 1998. After graduation she worked for four years in Mumbai, India writing
software for manufacturing and financial companies.
She returned to graduate school in 2004 at the Florida State University to study Computational
Biology and Population Genetics. She graduates with a M.S. in Biological Science in Summer 2007.
35