Download Estimating Selection Coefficient Using the Ancestral

Florida State University Libraries Electronic Theses, Treatises and Dissertations The Graduate School 2007 Estimating Selection Coefficient Using the Ancestral Selection Graph Sonali Joshi Follow this and additional works at the FSU Digital Library. For more information, please contact [email protected] THE FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCES ESTIMATING SELECTION COEFFICIENT USING THE ANCESTRAL SELECTION GRAPH By SONALI JOSHI A Thesis submitted to the Department of Biological Science in partial fulfillment of the requirements for the degree of Master of Science Degree Awarded: Summer Semester, 2007 The members of the Committee approve the Thesis of Sonali Joshi defended on June 1, 2007. Peter Beerli Professor Directing Thesis Gavin Naylor Committee Member David Swofford Committee Member The Office of Graduate Studies has verified and approved the above named committee members. ii ACKNOWLEDGEMENTS I am extremely thankful to my major professor Dr. Peter Beerli for giving me an opportunity to work with him. His support and patience was invaluable not only in my research but also in coping with other areas of graduate school. He was always enthusiastic about this project, which gave me great motivation. Koffi Sampson took time out of his research to help me understand the intricate mathematics of the Ancestral Selection Graph. I would also like to thank him for patiently pointing me in the right direction during my research. I would like to thank the members of my supervisory committee, Dr. Dave Swofford and Dr. Gavin Naylor for their insightful comments during the course of this research work. Their encouragement provided a tremendous boost for my confidence. Finally, I would like to thank my family for their support and especially my husband, for without his help I would not be able to accomplish this. Part of this research was supported by the joint NSF/NIGMS Mathematical Biology program with NIH grant R01 GM 078985. iii TABLE OF CONTENTS List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii 1. INTRODUCTION . . . . . . . . . . . . 1.1 The coalescent . . . . . . . . . . . 1.2 Genealogies and selection . . . . . 1.3 Summary Statistics . . . . . . . . 1.4 Detecting and measuring selection 2. MODELS . . . . . . . . . . . . . 2.1 Deterministic models . . . . 2.2 Ancestral Selection Graph . 2.3 Trees conditional on sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 4 5 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . allele configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 . 9 . 10 . 14 3. METHODS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.1 Bayesian Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2 Simulation based Rejection methods . . . . . . . . . . . . . . . . . . . . . . . . . 21 4. RESULTS . . . . . . . . . . . . . . . . . . . . . 4.1 Implementation . . . . . . . . . . . . . . . . 4.2 Results . . . . . . . . . . . . . . . . . . . . 4.3 Testing the Estimator with Simulated Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 24 25 27 5. CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 iv LIST OF TABLES 2.1 Rules to resolve branching going up the Ancestral Selection Graph . . . . . . . . . . 13 2.2 Comparing the time to the Ultimate Ancestor (TU A ) of the Ancestral Selection Graph and the corresponding time to the Most Recent Common Ancestor (TM RCA ) of the extracted genealogy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 Branching events looking backward in time . . . . . . . . . . . . . . . . . . . . . . . 15 4.1 Change in Standard Deviations in tree lengths with Selection Coefficient . . . . . . . 28 v LIST OF FIGURES 1.1 A realization of the coalescent genealogy for a sample of size 4 . . . . . . . . . . . . 2 1.2 Variation in genealogies with the Coalescent. Figure courtesy of P. Beerli. . . . . . . 3 1.3 Sample genealogies - from Hudson & Kaplan [1] (a) A selective sweep (b) A typical neutral genealogy (c) A balanced polymorphism . . . . . . . . . . . . . . . . . . . . . 5 2.1 Example of an Ancestral Selection Graph, showing branching and coalescing events. The graph starts with a sample of size 3 and is constructed until the time of the Ultimate Ancestor (UA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Example of an Ancestral Selection Graph and the corresponding extracted genealogy. 12 2.3 Graph showing the increase in the average number of branching events with selection coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 Example of an Ancestral Selection Graph with a sample of known sample allele configuration and two possible genealogies. Only the real branches are a part of the final genealogy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.1 Graph showing the average number of selected alleles (A2) that arise at the tips of the ASG, in a simulated sample of size 10. The two curves show the dependence of the alleles at the tips on the allele selected at the UA. The red curve shows the average number of A2 alleles when the UA allele is A2 and blue curve is for the UA allele A1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.2 Change in tree lengths with strength of selection . . . . . . . . . . . . . . . . . . . . 26 4.3 Change in mean tree length with σ and sample allele frequencies. Colors show the allele frequencies of the favored allele (A2) at the tips, blue = 0, green = 0.5 and red = 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.4 Histograms showing the posteriors of θ using different summary statistics. The dotted vertical lines show the true value of θ. . . . . . . . . . . . . . . . . . . . . . . 29 4.5 Comparing run lengths - The dotted line shows the true value of σ . . . . . . . . . . 30 vi 4.6 Histograms showing the change in the posterior of σ with different priors for θ. The dotted vertical lines show the true value of σ. . . . . . . . . . . . . . . . . . . . . . . 30 vii ABSTRACT Detecting and measuring selection is of fundamental importance in many population genetic studies. The objective of this study is to estimate the strength of selection acting at a locus by studing polymorphism in neutral regions of DNA that are tightly linked to the given locus. Kingman’s [2] Coalescent theory gives us a model to reconstruct the ancestral history of a sample for neutrally evolving sites. The Ancestral Selection Graph (ASG), introduced by Krone and Neuhauser [3] is an extension of the neutral coalescent process and incorporates selection. However, it’s use has been limited due to computational issues in simulating the graph. In this study I use the Ancestral Selection Graph conditional on the sample allele configuration, as described by Slade [4] for efficient simulation of selected genealogies. Inferences in the coalescent framework are often based on Likelihood or Bayesian approaches. As population genetic models get more complex there is a growing interest in simulation based Approximate Bayesian Methods. Tavaré et al.[5] first described an Approximate Bayesian Computation (ABC) method for simulating observations from posterior distributions without the use of Likelihoods. These methods use summary statistics to study the data and infer parameters. An estimator of Selection Coefficient is built using the Ancestral Selection Graph to simulate selected genealogies and the Approximate Bayesian approach to estimate parameters. The effect of the choice of summary statistics, priors and run lengths on estimation of parameters are explored using simulated data. viii CHAPTER 1 INTRODUCTION The aim of population genetics is to study within-species variation in genetic data and to infer parameters of the processes that could cause the variation. The parameters of interest are the mutation, migration, recombination, population growth rates etc. Earlier methods based on summaries of data have given way to computationally oriented statistical methods. This change is driven by the increase in computational power and the large quantities of data currently available. Statistical analysis is usually based on finding a model that best explains the variation in the data and then estimating parameters of the model. Methods like Maximum Likelihood and Bayesian inference based methods are widely used. In this study I focus my attention on the genetic variation that is due to selection and develop a method to measure the strength of selection. The current chapter introduces the coalescent, a model to simulate neutral genealogies, changes in genetic variation expected due to selection, and some summary statistics of data used to detect this difference. Natural selection acts on individuals based on their phenotype or fitness but manifests itself at the population level as a change in allele frequencies through time. This is fundamental to the study of evolution. Detecting and estimating selection is a central problem in population genetics and has been addressed at many levels. Parameters of interest with respect to selection are (1) the time of the origin of the selected mutation (in case of a selective sweep), (2) the strength of selection and (3) signatures of selection along a chromosome. Molecular data gives us a look at a finer level of population dynamics and estimating the strength of selection from DNA data is an active area of research. 1.1 The coalescent The neutral theory states that in a single population with no gene flow, most genetic variation is due to neutrally evolving sites and only mutation and genetic drift shape the variation seen. Though 1 other evolutionary forces like recombination, migration, selection do shape genetic variation, the neutral theory serves as a null model for expected genetic variation in a given sample. Most modern statistical population genetics is based on the Coalescent theory described by Kingman [2] and [6]. The main assumptions of the theory is that the sites under consideration are evolving neutrally in a single population, with only genetic drift and mutation leading to the current polymorphisms. It describes the genealogical history of a sample of k genes in a population of constant size N . The genealogy is reconstructed backward in time until the time of the most recent common ancestor(TM RCA ) of the sample is found. Coalescent events occur with waiting times that are exponentially distributed with mean 2/k(k − 1). Time is often measured in units of 2N generations. The sample size reduces by one when two random individuals are chosen to coalesce. In the limit when N → ∞, the probability that more than two alleles coalesce at the same time is neglected. Figure 1.1 shows a coalescent genealogy for a sample of k = 4 . t1 t2 t3 TMRCA past Figure 1.1: A realization of the coalescent genealogy for a sample of size 4 The coalescent is based on two insights [7]: First is that the coalescent can separate the mutation process from the genealogy process, because the neutral sites do not affect reproductive success. Hence we can first create the genealogy as described above and then superimpose mutations on it. Secondly, the neutrality assumption also allows us to create an ancestry of only a sample of genes without worrying about the whole population. In short genetic drift is modeled by the genealogy and neutral mutations are modeled by first creating the genealogy and then adding mutations to 2 the tree. Refer to Nordborg [7] for an in-depth review of the coalescent. The coalescent is thus an efficient tool for simulating genealogies, generating data, and studying the expected variation in polymorphisms under neutral assumptions. It allows us to develop algorithms for computer simulations of population samples under various models, to study these distributions and to estimate likelihoods of the population parameters. Figure 1.2 shows the typical variation expected in genealogies and hence in genetic data for a given set of parameters. This predicts that a large variation is expected in neutral data due to random events or chance. freq. [10-6] 25. 20. 15. 10. 5. 20 40 60 80 Time to MRCA 100 [103 generations] Figure 1.2: Variation in genealogies with the Coalescent. Figure courtesy of P. Beerli. As long as we can keep the genealogical and the mutational processes separate, the coalescent can be extended to incorporate other forces such as recombination, population growth, population subdivision, migration etc. The effects of selection can be included in the coalescent, only if they are so strong that frequencies of different genetic backgrounds can be approximated as changing deterministically. In these cases the allele frequencies are assumed known through the ancestral generations and the coalescent is modeled for the allelic classes, assuming no selection within them. But as reproductive success depends on allele type when we consider selection, it becomes difficult to incorporate selection in the coalescent framework. Chapter 2 describes the Ancestral Selection Graph, an extension of the coalescent with selection that does require the knowledge of allele frequencies in the ancestral generations. It modifies the coalescent tree building process to include branching events and thus forming a network or graph. The final genealogy is extracted from the 3 graph following a set of simple rules that, in effect, give the selected allele an advantage. 1.2 Genealogies and selection Even though a random sample of individuals from a population is taken for a study, they are ultimately related by a genealogy. Gene trees or genealogies represent the history of a sample of genes from a population. Patterns of variation in DNA are shaped by the genealogical history of the samples. Though we may never know the true gene trees, we can say something about the expected variation in a sample if we know distribution of trees under various population models. Kingman’s coalescent gives us a statistical model for gene trees under neutrality. This is essentially a null model and changes in genealogy can be brought about by migration, bottlenecks, population growth, population substructure, selection and other forces. Since we know the distribution of genealogies under the neutral assumptions using the coalescent, a change from this expected value can be seen as an evidence that some other force has shaped this variation. Selection can be detected similarly as it is expected to change the genealogies. To compare against a neutrally evolving region, it is customary to look at linked neutral regions of selected sites to detect and estimate selection. Differentiating between different causes of molecular variation Since the forces such as migration, recombination, selection are all acting simultaneously in any population it is difficult to tease apart their separate effects and any analysis depends much on our prior beliefs. However, while bottlenecks and migrations should affect the whole genome in the same way, selection acts on a specific part of the genome. Estimating selection is further complicated by the fact that weak selection might not be distinguishable from the action of drift. Allele frequency can also increase in a population by chance, rather than selection. Even neutral nucleotide diversity has a large variance hence might be difficult to distinguish from effects of weak selection. Recombination may further reduce these effects in linked neutral variation. Kim and Stephan [8] describe a statistical method to test the significance of a local reduction in variation caused by a hitchhiking event to differentiate it from a reduction due to drift. Theory suggests that selection would change the genealogy and any linked neutral variation in a certain expected way. Figure 1.3 shows some changes that selection is expected to cause as 4 compared to tree (b) which is a neutral genealogy. • Directional selection or Selective sweep (a) - The tree shows a selective sweep event. A favorable allele originates in a population at a time where all branches emerge and sweeps through to fixation. It carries with it linked neutral sites (hitchhiking) and effectively shortens the genealogy. • Background or Purifying selection - happens when deleterious mutations are purged out by selection. This removes many linked mutations and again shortens the genealogy [9]. • Balancing selection (c) - as shown by tree maintains variation in a population and effectively increases the length of the gene trees. a c b past Figure 1.3: Sample genealogies - from Hudson & Kaplan [1] (a) A selective sweep (b) A typical neutral genealogy (c) A balanced polymorphism 1.3 Summary Statistics Statistics of the DNA data of a sample can capture any change in the tree length or topology. These can easily be compared to the neutral expectations, since we know the distributions of the gene trees under neutrality. The following examples of commonly used summary statistics refer to Figure 1.3. • Number of Segregating Sites S - is a count of all polymorphic sites in a sample. Any change in the total length of the coalescent tree would be reflected in the number of variable sites in 5 the data. For example, a locus hitchhiking with a selected site undergoing a selective sweep as in tree (a) will have a lower S compared to the neutral expectation. • Average Pairwise Differences, Π - is calculated by taking all pairs of individuals in a population and computing the average number of differences between them. Any change in branch lengths or topology would be reflected in the average pairwise distances. Tree (c) with balancing selection shows that two clusters of haplotypes are maintained in the population, increasing the value of Π. • Site frequency spectrum - For n sample sequences, the site frequency spectrum is a random vector of size n-1, where each position represents the number of sites where that many number of individuals in the sample have a mutation. In other words it shows the number of individuals having a single mutation, two mutations and so on. A starlike tree (a), for example, will have more singleton mutations. It reflects any change in topology of the gene tree. • Number of distinct haplotypes H - is a count of distinct haplotypes and is a measure of Linkage Disequilibrium. A hitchhiking event could lead to strong Linkage Disequilibrium. Summary statistics are an effective way to characterize a dataset. They can also be used to estimate certain population parameters accurately. 1.4 Detecting and measuring selection The first estimators of population parameters were based on summaries of data. Most common among them being Watterson’s [10] estimator of θ, which is based on the number of segregating sites in a dataset, corrected for the sample size. If k is the number of segregating sites and n is the sample size, then the estimate of θ is given by, θ = k/a where a = 1 + 1 2 + ... + 1 n−1 . The Tajima’s [11] estimator of θ is based on the average pairwise distances. If Πn is the average number of pairwise distances in a sample of size n, then θ = Πn 6 Historically the focus has been on detecting the presence or signature of selection from molecular data rather than measuring the strength of selection. Various tests that measure any departure from the estimated values under neutrality were developed to detect the presence or absence of selection. Tajima’s’ D Tajima’s D [12] makes use of the above mentioned ways of estimating θ to come up with a selection test. The assumption here is that in the neutral case the two estimates will be the same. Since k ignores the frequency of mutations, it is affected by occurrence of rare alleles. Whereas, the calculation of Πn takes the frequency into account. Tajima’s D is roughly defined as the difference between the two estimates and scaled appropriately. D = θ(Πn ) − θ(k) A selective sweep reduces variation and k recovers faster than Π, giving a negative value of D. Balancing selection maintains variation by forming clusters of haplotypes and increases Π, leading to a positive D. McDonald-Kreitman test Because synonymous variation is not under selection, selection can be detected from the difference in the synonymous and nonsynonymous variation in a protein coding region. McDonald and Kreitman [13] proposed a statistical test based on this observation, comparing the number of fixed sites to the number of polymorphic sites. Ratio of amino acid replacement to silent polymorphisms within a population is compared to replacement to silent fixed differences of the population with an outgroup. Under neutrality the ratios are expected to be equal. Any departure is an evidence for selection. Allele Frequencies Variation in allele frequencies among populations can be an effect of natural selection. All neutral polymorphisms that are due to mutation and drift are expected to have similar variation in geographically isolated populations. FST based statistics are used to detect this variation. Linkage Disequilibrium Directional selection may cause linkage disequilibrium (LD) as a selected allele that sweeps to fixation takes along with it other linked sites, thereby reducing variation. Reduction in variation in 7 the linked regions depends on the strength of selection, recombination and time elapsed since the selection event. Strong LD may be associated with alleles if selection has recently increased their frequencies. While presence of selection can be shown in many cases, the strength of selection cannot be measured by these methods. For example, Verrelli et al. [14] use various summaries of data to find evidence of balancing selection in human G6PD data. The aim of this work is to build an estimator to measure the strength of selection. Computational methods based on summaries of data are gaining popularity due to their ease of implementation. Przeworski [15] developed a summary-based rejection method to estimate the time since fixation of a beneficial allele. Briefly, the goals of this work are: 1. To study the Ancestral Selection Graph, a coalescent based model with selection, to get the expected variation in genealogies due to selection. 2. To study inference using simulation and summary statistic based Approximate Bayesian methods 3. To build a selection strength estimator based on the Ancestral Selection Graph using the technique of Approximate Bayesian Computation [16]. The following chapters describe the model and methods in further detail. 8 CHAPTER 2 MODELS Coalescent theory gives us an elegant way to simulate genealogies for a sample of genes. Models that incorporate selection into the coalescent framework fall into two categories based on the way they treat the change in allele frequencies through generations, deterministic and stochastic models. Deterministic methods assume that selection is strong and that allele frequencies through the ancestral generations are known. The following sections discuss these models in further detail. 2.1 Deterministic models This approach assumes that strong selection overcomes the effects of drift and that allele frequencies are expected to change deterministically. These methods first calculate the allele frequencies through the ancestral generations deterministically. Consider selection acting at a locus with two alleles A1 and A2. The populations of each allele type through ancestral generations can be thought of as an ‘allelic class’. Within each ‘allelic class’ there is no selection and hence we can use the standard coalescent to model genealogies for each class. The effect of balancing selection by this method reduces to a two population problem with migration taking the place of mutation between classes [17]. A selective sweep can be thought of as increasing the frequency of a selected allele deterministically and can be modeled as a population growth [18]. Slatkin [19] described an importance sampling method to simulate the history of a selected allele by simulating backwards from the current frequency until the allele is lost. Hudson and Kaplan [20] studied the effects of overdominant selection, assuming that frequencies of two alleles are fixed. Here, the problem reduces to two island model, with mutation and recombination playing the role of migration. These methods clearly perform poorly when selection is weak or of comparable strength to drift. Though the mathematical expressions for the branch length changes in a sample of size two have been derived, it is difficult to generate the entire genealogy efficiently. 9 2.2 Ancestral Selection Graph The stochastic method is the Ancestral Selection Graph proposed by Krone and Neuhauser [3]. The Ancestral Selection Graph (ASG) is an extension of the neutral coalescent process that incorporates selection. This was the first attempt to simulate selected genealogies from the coalescent without assuming allele frequencies known through the ancestral generations. The main problem in incorporating selection is that we lose the simplicity of the coalescent assumption of neutrally evolving sites and we cannot separate mutation process from the genealogy building process. To overcome this problem, a network or graph called the Ancestral Selection Graph is constructed for the sample of size k with the true genealogy embedded in it. In addition to coalescent events the graph contains branching events which allow genes to take alternative paths as the graph evolves. This separates the mutation process from the graph building process. The final genealogy is then extracted from the graph based on certain rules which give the favored allele selective advantage. The following sections describe the model. T0 T1 T2 Branching event T3 Coalescent event T4 UA past Figure 2.1: Example of an Ancestral Selection Graph, showing branching and coalescing events. The graph starts with a sample of size 3 and is constructed until the time of the Ultimate Ancestor (UA) This model is described in detail in Krone and Neuhauser [3] and Neuhauser and Krone [21]. The model as described here, assumes a haploid population evolving according to the Moran model. A single locus with two alleles A1 and A2, is under selection, with A2 having a selective advantage s. The population size N is large and constant. Mutation is symmetric between the alleles with 10 rate µN . Population model The model used to describe the theory is the continuous time Moran model with selection and mutation. However, as the the results are obtained using diffusion approximation it applies equally to the discrete time Wright-Fisher model in the limit N → ∞. The Moran model is described as a continuous time Markov process in which a random individual is chosen to reproduce at a given rate and the offspring replaces an individual thereby keeping the population size constant. Thus, this is essentially as birth-death process which can be analyzed using Markov chain theory. Diffusion approximation The above Markov process is studied as a diffusion process by rescaling the parameters. The diffusion theory holds in the limit as the population size tends to infinity. Time is rescaled and measured in units of N generations. The other parameters are rescaled as N sN → σ and N µN → θ as N → ∞ where σ and θ are positive and finite. The unique stationary distribution of this process has density f (x), which is a special case of Wright’s formula. f (x) = Kxθ−1 (1 − x)θ−1 e−σx , where x is the frequency of either allele, and K is a normalizing constant. The model thus applies to large populations where mutation rate and selection coefficient are small. The standard diffusion holds for any σ and θ, as long as these remain finite as N tends to infinity. An excellent review about the limitations of these assumptions is given by Wakeley [22]. Constructing the genealogy using the Ancestral Selection Graph The behavior of the Ancestral Selection Graph(ASG) evolving according to the Moran model is derived using the biased voter model. For a full description of this model refer to [3]. The ASG construction involves branching events along with the coalescent events going backward in time. A coalescent event reduces the sample size by one while a branching event increases the sample size by one in the graph. The branching event does not add an ancestor to the real genealogy, but just serves to give an alternative path the genealogy could take so as to 11 give a selective advantage to the favored allele. This advantage is achieved by the way branching is resolved while extracting the true genealogy from the graph. The basic steps in getting the genealogy can then be summarized as: • Constructing the graph - If the sample size is k, at any given time, the rates at which coalescing or branching events happen are defined by Coalescing : k → k − 1 at rate k(k − 1)/2 Branching : k → k + 1 at rate kσ/2 The two branches emerging out of a branching node and are labeled as incoming and continuing respectively. These rates are scaled by N to get branch lengths in units of generations. Starting from the current sample of k alleles this process is run backwards in time until the sample size reaches 1 for the first time. This ancestor is referred to as the Ultimate Ancestor(UA), which is not necessarily the most recent common ancestor M RCA of the samples. T0 T1 ? A1 A1 A1 A1 A2 A1 T2 A2 A1 Mutation Selected branch T3 A1 A1 MRCA T4 A1 UA past Figure 2.2: Example of an Ancestral Selection Graph and the corresponding extracted genealogy. • Adding mutations - the branching makes the mutation process independent of constructing the graph. Starting from the TU A we add mutations according to a Poisson process with rate θ/2 independently along each branch of the graph. The type of U A is then chosen according 12 to the stationary distribution of the process and is evolved up to the tips depending on the location of the mutations. • Extracting the true genealogy - We then travel up the graph to extract the genealogy. At each branching node we face a choice of which branch to keep in the real genealogy. This is where the selective advantage of the allele matters. If both alleles coming towards the branching point are the same then we can choose to keep either branch with equal probability, but if the alleles are different then the branch containing the selected allele A2 is always kept in the final genealogy. Table 2.1 shows the rules to resolve branching. This leads to an increase in the favored allele. Table 2.1: Rules to resolve branching going up the Ancestral Selection Graph Branch 1 A1 A1 A2 A2 Branch 2 A1 A2 A1 A2 Keep Branch Either 1 or 2 Branch 2 Branch 1 Either 1 or 2 It is easy to see that if σ = 0 then there will be no branching and the method reduces to a simple coalescent. The branching rate depends on the strength of selection, while coalescent rate depends only on the number of ancestors present in the sample at any given time. Notice that as σ increases, the rate at which the events happen also increases, which means that the number of branching events increase. This does not necessarily mean that the total tree length will increase, as the branch lengths are decreasing correspondingly. Extending the Ancestral Selection Graph Neuhauser and Krone [21] describe a number of models to which the ASG can be extended, amongst which are the K-allele model, the infinitely many alleles model, infinitely many sites model and diploid models. These models give more complicated graph structures than the haploid model discussed here. For example, the rules for resolving branchings change for K-allele models while for the diploid model a branching event leads to three branches instead of two. 13 Drawbacks of the Ancestral Selection Graph One apparent drawback of this method is that as the number of branches increase with σ the size of the graph becomes too large. As the entire graph is required for the embedded genealogy to be extracted, it needs to be stored in the computer memory. Even for low σ values this number gets too large for the computer memory. This limits the use of ASG to very low σ values, typically σ ≤ 12. Figure 2.3 shows how the average number of branching events increase with the strength of selection. Number of branching events 1e+06 600000 200000 0 0 2 6 sigma 10 14 Figure 2.3: Graph showing the increase in the average number of branching events with selection coefficient Secondly, as strength of selection increases the branching rate also increases and hence it takes a very long time for the sample size to reach one. We can see from table 2.2 how the time to the UA increases with the strength of selection, though the time to MRCA decreases or remains the same. This shows that the real genetree coalesces much before the UA is reached and a lot of time is wasted in reaching the UA which is just an artifact of the ASG. Finally, we have no control over the distribution of allele frequencies at the tips of the final genealogy. The following section describes attempts to overcome these problems and to get a more efficient algorithm to generate trees using the ASG. 2.3 Trees conditional on sample allele configuration Typically in simulating selected genealogies using the ASG we have no control over the frequencies of the alleles that arise at the tips, as these depend on the assumed ancestral allele at the U A and the strength of selection. But often, in the population or sample under study we know the allele 14 Table 2.2: Comparing the time to the Ultimate Ancestor (TU A ) of the Ancestral Selection Graph and the corresponding time to the Most Recent Common Ancestor (TM RCA ) of the extracted genealogy σ 0 1 2 3 4 5 6 7 8 9 10 Mean TM RCA 2.5495 2.6678 2.5929 2.5304 2.4271 2.3466 2.3109 2.1818 2.0845 2.0596 2.0645 Mean TU A 2.5495 2.9734 3.1277 3.3006 3.6272 4.5880 7.2350 13.3422 36.0303 58.4277 227.4843 frequencies. We might have to reject a lot of trees if we wish to generate genealogies that explain the sample at hand. Hence it might be more useful to draw trees conditional on the known sample configuration. One way to look at the extra branches in an ASG is to label them as real and virtual [21]. The branches that will be a part of the final genealogy are called real and the extra branches produced at the branching events are called virtual. If we know the allele type at the tips we can assign a type to each branch going backward in time while constructing the ASG, without the knowledge of the type of the U A. Table 2.3: Branching events looking backward in time Branching allele A1 A2 A2 Emerging alleles A1, A1 A1, A2 A2, A2 Advantage mattered? No Yes No While table 2.1 gives rules to resolve branching going up the graph, table 2.3 states the same rules going down the graph as it is being constructed [4]. As an example refer to figure 2.4 showing a ASG of a sample of three A1 alleles, with no mutation on the graph. While constructing the graph backwards, it is straightforward to assign type real or virtual to each branch at a branching event 15 A1 A1 A1 A1 A1 A1 A1 A1 A1 T0 T1 T2 real virtual virtual real T3 T4 ASG Possible Genealogy - 1 Possible Genealogy - 2 (a) (b) (c) Figure 2.4: Example of an Ancestral Selection Graph with a sample of known sample allele configuration and two possible genealogies. Only the real branches are a part of the final genealogy. as we know exactly which allele emerges out of this node. In this case the allele is A1 and hence labelling of the branches is arbitrary. We can see from the figure that the extracted genealogy from the ASG (a) can take one of the two forms, either (b) or (c) with equal probability. Since selecting either branch is consistent with A1 allele at the tip, it does not matter which branch is kept in the final genealogy. Here as the decision of which branch to keep in the final genealogy is made at each branching point, the construction of the ASG can end when the M RCA is reached(the number of real samples reaches one). Based on this fact, Slade [4] [23], proposed modifications to the ASG algorithm to draw a genealogy conditional on the given sample allele configuration. Starting from the known allele types in the sample, the ASG is constructed going backward in time as before, but the final genealogy is identified at the same time. This can be done only when the branches can be labelled as real or virtual at each event i.e the events themselves need to be classified as real or virtual. Three events are considered simultaneously, branching, coalescing and mutation events. Going backward in time 16 the rate of the three events is given by Coalescing : k → k − 1 at rate k(k − 1)/2 Branching : k → k + 1 at rate kσ/2 Mutation : at rate θ/2 In the algorithm a count of real and virtual ancestors is kept at all times. At a coalescent event only two similar alleles can coalesce, reducing the sample size by one. At a branching event, one extra ancestor of the same type is created. The number of real ancestors in a coalescent cannot increase, hence the extra branch created at the branching event is labelled as virtual and it is not a part of the final genealogy. The other branch is then called real. At a mutation event the type of allele changes from A1 to A2 and vice versa. Thus the number of real ancestors eventually reaches the size of one, in which case the MRCA is obtained. This method has the advantage of reducing the computer memory problems as the entire graph need not be drawn out, but it might still take a long time to converge to a sample size of one depending on the mutation rate and selection coefficient. To construct the ASG conditional on the sample allele configuration we need to keep a track of the three events(branching, coalescing and mutation), the type of event(real or virtual ) and the type of allele (A1 and A2). The probabilities for each event can then be derived [4]. In the following derivations, i and j refer to the allele types A1 and A2. ri = number of real ancestors of type i, vi = number of virtual ancestors of type i and n = (n1 , n2 ) is the initial sample size. Similarly r = (r1 , r2 ) and v = (v1 , v2 ) are the total number of real or virtual ancestors. Probability of coalescence An event is a real coalescence event(a part of the final genealogy) only if both the ancestors involved are real. Probability of real coalescence = P(Coalescence) * P(i or j) * P(two real ) ni − 1 ri (ri − 1) ri (ri − 1) n(n − 1) = n(n − 1 + σ + θ) n − 1 ni (ni − 1) ni (n − 1 + θ + σ) An event is a virtual coalescence if either or both ancestors are virtual. This is not reflected in the genealogy, but just reduces the number of virtuals by 1. Probability of virtual coalescence = P(Coalescence) * P(i or j) * P(two virtual or one of each) n(n − 1) ni − 1 vi (vi − 1 + 2ri ) vi (vi − 1 + 2ri ) = n(n − 1 + σ + θ) n − 1 ni (ni − 1) ni (n − 1 + θ + σ) 17 Probability of mutation Probability of real mutation = P(Mutation) * P(i or j) * P(real ) ni + 1 rj θ n(n − 1 + σ + θ) n nj Probability of virtual mutation = P(Mutation) * P(i or j) * P(virtual ) ni + 1 vj θ n(n − 1 + σ + θ) n nj Probability of branching Both real and virtual ancestors can branch but the new branch created is always virtual as the number of real ancestors cannot increase. Table 2.3 shows the two cases to be considered. First when both emerging alleles are of the same type. A virtual allele of either type can be gained. Probability of branching (gain of virtual Ai ) = P(Branching) * P(i or j) ni (ni + 1) σ n(n − 1 + σ + θ) n(n + 1) Second, a gain of an A1 allele when a A2 allele branches. Probability of branching (gain of virtual A1 )= P(Branching) * P(A1 ) σ 2n2 (n1 + 1) n(n − 1 + σ + θ) n(n + 1) Based on these probabilities, the normalizing factor for Monte Carlo simulation of genealogies is given by equation 3.1 in [4]. h(r, v) =  1  n−1+θ+σ X (ri − 1) + i∈(1,2),ri >0 X i∈(1,2),vi >0   (vi − 1 + 2ri )  X X (r + 1) (v + 1) θ i i   + + n−1+θ+σ n n i,j∈(1,2),rj >0,i6=j i,j∈(1,2),vj >0,i6=j   X ni (vi + 1) 2n2 (v1 + 1) σ   + + n−1+θ+σ n(n + 1) n(n + 1) i∈(1,2) The probabilities of branching, coalescing or mutation events are then given by 18 ri − 1 h(r, v)(n − 1 + θ + σ) vi − 1 + 2ri Virtual coalescent : vi → vi − 1 at rate = h(r, v)(n − 1 + θ + σ) θ(ri + 1) Real mutation : ri − 1, rj + 1 at rate = h(r, v)n(n − 1 + θ + σ) θ(vi + 1) Virtual mutation : vi − 1, vj + 1 at rate = h(r, v)n(n − 1 + θ + σ) σ(ni (vi + 1) + δi,1 2n2 (v1 + 1)) Branching : vi → vi + 1 at rate = h(r, v)n(n + 1)(n − 1 + θ + σ) Real coalescent : ri → ri − 1 at rate = where δi,1 is the Kronecker delta, and i = 1, 2, i.e., 1, i = j δij = 0, i 6= j The transition probabilities sum to one. The process is run backwards in time until the real sample size reaches one, which is the M RCA of the genealogy. While constructing the genealogy only the real branches and nodes are constructed as a structure in the computer memory. The branching events add one virtual ancestor which only serves to increase the number of ancestors for purpose of rate calculations and has no real significance in terms of the genealogy. Similarly, a coalescent event of virtual type just decreases the number of virtuals by one. While constructing the graph the probabilities of real or virtual coalescence are calculated only if the number of ancestors are 2 or more. This avoids number of ancestors going negative. With these modifications, it is not necessary to have all the branches and nodes in the computer memory, and it is also not necessary to continue drawing the graph until the U A. Thus it is possible to simulate genealogies for much higher σ values than was possible with the standard ASG. 19 CHAPTER 3 METHODS In most model based statistical analysis, the choice of a method depends on how easily the model lends itself to calculating the likelihoods and how easy it is to traverse the parameter space. Typically, these methods do not have analytical solutions and one needs to approximate the likelihood. Commonly used methods are Monte Carlo(MC) approximations, Importance Sampling (IS) and Markov Chain Monte Carlo(MCMC). In the case of the Ancestral Selection Graph, traversing the graph and tree space can be extremely difficult for MCMC based analysis, though there are some recent advances [24]. MC might generate a large number of unsuitable trees. It is much easier to generate trees using the Ancestral Selection Graph and simulate data on them, rather than changing graph and tree topologies for MCMC methods making them easy to use in an Approximate Bayesian method. Here we evaluate a selection estimator using Ancestral Selection Graph as a simulator of selected genealogies. The following sections describe the motivation behind using the Approximate Bayesian Computation approach. 3.1 Bayesian Methods Inferences in the coalescent framework are often based on Bayesian approaches. These methods give the posterior distribution of the parameter of interest, θ. The background information is included as a prior distribution π(θ). The posterior distribution f (θ|D) for a dataset D is calculated as f (θ|D) ∝ f (D|θ)π(θ) These methods have the advantage of allowing background information to be incorporated in the model as priors, provide posterior probability distributions and integrate out nuisance parameters [25] [26]. Typically, these inferences use MCMC methods which can potentially give exact solutions, 20 but can be time consuming and difficult to implement [16]. For a review on current trends in statistical methods refer to Marjoram and Tavaré [27]. 3.2 Simulation based Rejection methods As datasets get larger and population genetic models more complex, there is a growing interest in simpler simulation based approximate methods. Tavaré et al [5] first proposed a rejection sampling method to approximate the posterior distribution. The data is replaced by a summary statistic S. Then instead of accepting a value of (θ) from the prior distribution with probability proportional to its likelihood (P (D|θ)), they proposed accepting it with probability proportional to the probability of getting the same S for the simulated prior(P (S = s|θ)) [16]. These methods are possible only when P (S = s|θ) is easy to calculate. Other variations on the same theme are proposed when such probabilities are difficult to calculate. One way is to generate a value from the prior, simulate new data using these prior values as parameters of our model and accept the prior if the simulated data equals the real data. The accepted values of the prior give the posterior distribution. No likelihood calculations are required. A rejection algorithm is the iteration of the following steps for a given set of data D. • Step 1: Sample a value of the parameter θ from the prior. • Step 2: Simulate data D′ using your model with the sampled parameter value of θ . • Step 3: Accept the prior value of the parameter θ if D = D′ , else ignore the prior. • Step 4: Return to step 1. 3.2.1 Approximate Bayesian Computation(ABC) In case the data is too complicated then the acceptance rate of the above algorithm becomes very low. Hence we approximate the data comparison by taking summary statistics of the data. Fu and Li [28] suggested further improvements to the above algorithm, by calculating summary statistics of both datasets and comparing them instead of the entire dataset. Another level of approximation is that instead of an exact match of the simulated and observed summary statistic, we accept the parameter if the two are sufficiently close [27]. The iterative algorithm then becomes : • Step 1: Calculate summary statistic S for given data. 21 • Step 2: Sample a value of the parameter θ from the prior. • Step 3: Simulate data using your model with the sampled parameter value of θ and calculate the summary statistic S ′ . • Step 4: Accept the prior value of θ if |S − S ′ | < ǫ, where ǫ is the margin of error or tolerance, else ignore the prior value. • Step 5: Return to step 1. Since the data is approximated at many levels, the success of an ABC method critically depends on the choices made for the following: • Summary Statistic: The suitability(sufficiency) of the summary statistic needs to be evaluated for the problem at hand. It should exploit the difference in data that is due to the parameter you are trying to measure. • Data Simulator: The speed of the method will depend on the efficiency of the simulator used to generate data. Coalescent based models are well suited for this. • Tolerance (ǫ): Tolerance needs to be chosen as a tradeoff between accuracy, computation time and complexity. • Prior distribution: Since this method does not explore the parameter space beyond the prior, the true value of parameters needs to be in the range of the prior. 3.2.2 Sufficiency of statistics In population genetics there is no single sufficient statistic for any problems. Certain statistics may be adequate for a given parameter. For example Tavaré et al [5] found that their inference of coalescence times using statistic S(number of segregating sites) was quite close to that obtained using other methods. One way to overcome this problem is to use multiple summary statistics. Przeworski [15] used multiple summary statistics to estimate the time since fixation of a beneficial allele. However, doing so can reduce the acceptance rates or tolerance(ǫ) levels need to be increased. 22 3.2.3 Improving the ABC estimator Beaumont et al. [16] proposed modifications to the rejection methods to make the approximation insensitive to ǫ allowing the use of multiple summary statistics. This is done by using local linear regression and smooth weighting of summary statistics. Each accepted prior is given a weight that declines quadratically as a function of |S − S ′ | and then weighted linear regression is used to adjust the values of priors [26]. To overcome some problems with prior distributions being very different from the posterior, Marjoram et al. [29] have proposed a MCMC method without the use of likelihoods, that combine the simplicity of rejection methods and ability of MCMC to sample from areas of high posterior probability [27]. 3.2.4 Comparison to other methods In contrast to MCMC based analysis, Approximate Bayesian methods are easy to implement. The stopping criteria can be predetermined (for example, based on number of accepted priors required), eliminating the problem of convergence as in MCMC. Since ABC generates independent observations from the prior they can be used in parallel computation. However, they perform poorly when prior and posterior distributions are very different as many draws from the priors are rejected. MCMC methods give better results in this case. Some recent studies have compared the method favorably to MCMC and other alternative methods. Tallmon et al [26] compared their estimator of effective population size (SummStat) to three existing moment based and likelihood estimators. They found that the Approximate Bayesian was the least biased over the full range of parameters investigated. Beaumont et al [16] found that though approximate methods are faster to implement and take less time for analysis, the MCMC based methods are always superior in accuracy. Nevertheless it is an interesting approach for initial analysis of a new Bayesian method. 23 CHAPTER 4 RESULTS 4.1 Implementation The Ancestral Selection graph(ASG) based simulator for trees with selection conditional on the sample configuration and the estimator based on Approximate Bayesian Computation(ABC) were both written using the programming language C++. The ASG simulates a neutral sequence that is assumed tightly linked (no recombination) to a two allele selected locus. The alleles are A1 and A2 with A2 having a selective advantage. The program allows simultaneous estimation of two parameters θ = N µ and σ = N s (µ is the mutation rate, s is the selection coefficient and N the population size). The priors for both parameters are uniform between minimum and maximum values input by the user. The program prints out all the prior values sampled and the respective summary statistics to a file. Analysis and estimation of parameters is done using functions written in MATLAB. This allows for flexibility in estimating parameters with different error tolerance levels and combination of summary statistics without having to run the simulation process each time. The summary statistics selected are S, the number of segregating sites and π, the average number of pairwise differences. These statistics were chosen because they were easy to code and commonly used in inference of selection by other methods. On an average, the program takes between 2 to 10 hours to sample from 100,000 priors and calculate summary statistics depending on the parameters θ and σ. The same program can also be used to simulate DNA datasets for combinations of σ, θ and sample allele configurations. 24 4.2 4.2.1 Results Ancestral Selection Graph Change in frequencies with selection coefficient. If the Ancestral Selection Graph is simulated as proposed by Krone and Neuhauser (1997), there is no control over what allele frequencies arise at the tips. It depends on the choice of the Ultimate Ancestor(UA) and strength of selection. Figure 4.1 shows simulation results (θ = 1, sample size = 10) of how the average number of selected alleles (A2) change at the tips, as the strength of selection increases. At σ = 0 the type of alleles at the tips depend highly on the allele selected at the UA (which incidentally is the same as the MRCA for σ = 0). As the strength of selection continues to increase, the allele A2 has an advantage and the frequency of A2 alleles at the tips increases. Eventually it does not matter which allele was selected at the UA. 10 9 A2 frequency 8 7 6 5 4 3 0 2 4 6 8 10 Sigma Figure 4.1: Graph showing the average number of selected alleles (A2) that arise at the tips of the ASG, in a simulated sample of size 10. The two curves show the dependence of the alleles at the tips on the allele selected at the UA. The red curve shows the average number of A2 alleles when the UA allele is A2 and blue curve is for the UA allele A1. 25 2.6 2.4 Tree Length 2.2 2 1.8 1.6 1.4 1.2 0 2 4 6 8 10 Sigma Figure 4.2: Change in tree lengths with strength of selection 4.2.2 Properties of trees conditional on sample size Change in tree lengths with selection coefficient. Figure 4.2 shows simulation results of the change in mean conditional tree length with the strength of selection. This data is simulated using the conditional ASG for a sample of all selected A2 alleles at the tips, allele A1 at the the MRCA and θ = 1. As selection gets stronger it requires less time for an A1 ancestor to reach a sample configuration of all A2 alleles at the tips. Mutation-Selection balance The change in tree lengths depends upon mutation rates, the selection coefficient and the sample configuration. Figure 4.3 shows simulation results of the change in mean conditional tree length with the strength of selection and sample allele frequencies. The graph (a) is for θ = 1 and it can be seen that the mean lengths decrease in all cases, but more slowly when sample allele frequency is 0.5. Since only like alleles can coalesce, some time is spent in waiting for a mutation to occur for the final coalescence. This depends on the mutation rate. Graph (b) shows the opposite effect. Here, as mutation rate is lower (θ = 0.01), the mean tree lengths actually increase with selection. This is a form of mutation-selection balance. The mean tree lengths for having all A2 alleles at the tips decreases more rapidly than that for seeing all A1 alleles. Biologically, this makes sense because all favored alleles, A2 will rise in 26 2.2 0.5 A2 frequency = 0.5 0.45 2 0.4 1.8 0.35 Tree Length Tree Length 1.6 1.4 0.3 0.25 0.2 1.2 0.15 1 0.1 0.8 0.05 0.6 0 0 2 4 6 8 10 0 2 Sigma 4 6 8 10 Sigma (a) θ = 1 (b) θ = 0.01 Figure 4.3: Change in mean tree length with σ and sample allele frequencies. Colors show the allele frequencies of the favored allele (A2) at the tips, blue = 0, green = 0.5 and red = 1 frequency rapidly and are expected to find their ancestor sooner that all A1 alleles. Variance in tree lengths with the coalescent and selection Table 4.1 shows the standard deviations of the expected time to the M RCA for θ = 0.01 and from σ = 0 to 10. We can see that as σ increases, variance in the expected tree lengths goes on increasing. Also note from figure 4.3 that the mean lengths are decreasing or increasing with σ, depending on allele frequencies of the sample. Hence the percentage error increases substantially with σ. This variance will reduce the power of estimating the parameters. 4.3 Testing the Estimator with Simulated Data Unless otherwise mentioned all data were simulated for a sample size of 10 and a sequence length of 1000. Choice of Summary Statistics Different summary statistics capture information about different aspects of the data. Number of segregating sites S is known to be a good summary for θ under neutrality. Whereas, average number of pairwise distances can represent the data more accurately if the tree has two deep branches at 27 Table 4.1: Change in Standard Deviations in tree lengths with Selection Coefficient σ 0 1 2 3 4 5 6 7 8 9 10 A2 allele frequencies 0 0.5 1 0.01 0.014 0.010 0.031 0.234 0.027 0.042 0.316 0.034 0.048 0.366 0.035 0.049 0.395 0.041 0.052 0.415 0.037 0.053 0.431 0.040 0.055 0.445 0.039 0.055 0.452 0.037 0.054 0.462 0.040 0.054 0.45 0.042 the end, as shown by tree (c) in figure 1.3. Such trees are expected when θ is low and we have a non-uniform sample at the tips. To test the performance of summary statistics, data was simulated for θ = 0.01 and σ = 10. The priors were uniform for θ = 0.0001 to 0.1 and σ = 0 to 50. The sample frequency was chosen to be 0.5 for each allele. The data were first analyzed using only one sumary statistic S and then adding the other statistic π. Figure 4.4 shows the histogram for the posterior of θ in each case. The posteriors look very different and a combination of two statistics gives a result closer to the true value in this case. Comparing run lengths The number of samples from the prior needed to get a reasonable number of accepted values depends on the parameter space and the prior intervals. Figure 4.5 shows how the mean estimate of σ is sensitive to the number of priors sampled. Data were simulated for θ = 1 and σ = 10. The criterion was that at least 1000 priors should be accepted. The epsilon value was changed accordingly. Two summary statistics were used, S and π. In this case the estimate does not change much after about 50,000 sampled values from the prior. 28 (a) Summary statistic : S (b) Summary statistic : S and π Figure 4.4: Histograms showing the posteriors of θ using different summary statistics. The dotted vertical lines show the true value of θ. Estimating θ and σ together The genealogy changes in response to both mutation rate and the selection coefficient. If the priors are too wide then it might be difficult to estimate both parameters together. To test this possibility, data were simulated for θ = 0.01, σ = 10 and all selected alleles (A2) in the sample. Figure 4.6 (a) shows the histogram of the posterior for σ when the priors were θ = 0.0001 to 0.1 and σ = 0 to 50. The posterior has a high variance and a mode not close to the true value. The same data were analyzed again, this time estimating only σ, assuming that θ was known. This was done by setting the prior as θ = 0.01 and σ = 0 to 50. Figure 4.6 (b) shows the histogram of the posterior for σ. It can be seen that the posterior estimate of σ changes drastically when the prior for θ is restricted, and is closer to the true value. The posterior σ is now closer to the true σ = 10 of the simulated dataset. 29 50 45 40 35 Sigma 30 25 20 15 10 5 0 0 2 4 6 Number of sampled priors 8 10 12 4 x 10 Figure 4.5: Comparing run lengths - The dotted line shows the true value of σ (a) Priors: θ = 0.0001 to 0.1 σ = 0 to 50 (b) Priors: θ = 0.01 σ = 0 to 50 Figure 4.6: Histograms showing the change in the posterior of σ with different priors for θ. The dotted vertical lines show the true value of σ. 30 CHAPTER 5 CONCLUSION It is easy to simulate trees with directional selection using the ASG conditional on the sample allele configuration. The trees show an increase or decrease in length to the MRCA depending on the mutation rate, selection coefficient and allele frequencies at the tips. However the variance associated with the total tree length increases substantially as a function of the selection coefficient. Slade [30] reached the same conclusions about increase in variance of the time to the MRCA with an increase in either mutation rate or selection coefficient. Similar levels also applied to diploid selection. Selection also expands the state space of the graph [30]. This high variation will affect the precision with which the values of selection coefficients can be estimated. Mutation rate and selection coefficient together change the shape of the genealogies. A low level of variation in a data set can be attributed to a high mutation rate with a high selection coefficient or to a lower mutation rate with no selection. Thus with no background information on the value of θ (i.e with a wide prior), it might be difficult to estimate both θ and σ concurrently. Having specific prior information improves the estimates of σ substantially. Summary statistics capture important information about the variation in data and the underlying genealogy. Multiple statistics can capture more information about the data in certain cases. The two statistics S, the number of segregating sites, and π, the average number of pairwise differences capture different information of the underlying genealogy. When the shape of the genealogy is very different from that of neutral trees, as seen in the case of a mixed sample of two alleles with low mutation rate, a combination of the two statistics works better than S alone. This is not seen when the sample has only one allele type. If we have an efficient simulator for our model of choice and summary statistics that capture enough information about the data, then efficient applications that are easy to implement can be built using Approximate Bayesian methods. 31 5.1 5.1.1 Future Work Performance testing of the estimator The efficiency and performance of this estimator needs to be tested against multiple datasets with different values of θ, σ, sample allele frequencies, sequence lengths and various combinations of priors. The performance of the Approximate Bayesian approach needs to be compared to a MCMC based method to see if the latter is more robust to the choice of priors. 5.1.2 Improving performance of ABC The performance of the estimator can be further improved and the analysis made insensitive to the tolerance levels of comparing the summary statistics, by using weighting and linear regression as proposed by Beaumont et al. [16]. 5.1.3 Extending the ASG The effect of directional selection on linked neutral loci has maximum impact only when recombination is very low. If recombination is high or a long time has elapsed since fixation, it might not be possible to see any signs of the selective event in the regions that surround the site under selection. Introducing recombination into the model will make the model more realistic especially if we want to get more information from longer sequences. Extensions of this simple haploid, two allele Ancestral Selection Graph have been proposed by Neuhauser and Krone [21] and Slade [4]. Extending the model to K-allele, diploid models will make it more widely applicable. 5.1.4 Testing with real data Testing with real data will indicate if this estimator and simulator works for realistic values of mutation rates, population sizes and selection coefficients. 32 REFERENCES [1] R. R. Hudson and N. L. Kaplan. The coalescent process and background selection. Philosophical Transactions: Biological Sciences, 349(1327):19–23, 1995. (document), 1.3 [2] JFC Kingman. The coalescent. Stochastic Processes and Their Applications, 13:235–248, 1982. (document), 1.1 [3] Stephen M. Krone and Claudia Neuhauser. Ancestral processes with selection,. Theoretical Population Biology, 51(3):210–237, June 1997. (document), 2.2, 2.2, 2.2 [4] Paul F. Slade. Simulation of selected genealogies. Theoretical Population Biology, 57(1):35–49, 2000. (document), 2.3, 2.3, 5.1.3 [5] S. Tavare, D. J. Balding, R. C. Griffiths, and P. Donnelly. Inferring coalescence times from dna sequence data. Genetics, 145(2):505–518, February 1997. (document), 3.2, 3.2.2 [6] JFC Kingman. On the genealogy of large populations. In J Gani and EJ Hannan, editors, Essays in Statistical Science, pages 27–43. Applied Probability Trust, London, 1982. 1.1 [7] Magnus Nordborg. Coalescent Theory. Handbook of Statistical Genetics, 2000. 1.1 [8] Yuseob Kim and Wolfgang Stephan. Detecting a local signature of genetic hitchhiking along a recombining chromosome. Genetics, 160(2):765–777, 2002. 1.2 [9] B. Charlesworth, M. T. Morgan, and D. Charlesworth. The effect of deleterious mutations on neutral molecular variation. Genetics, 134(4):1289–1303, 1993. 1.2 [10] G. A. Watterson. On the number of segregating sites in genetical models without recombination. Theoretical Population Biology, 7(2):256–276, 1975. 1.4 [11] Fumio Tajima. Evolutionary relationship of dna sequences in finite populations. Genetics, 105(2):437–460, 1983. 1.4 [12] F. Tajima. Statistical method for testing the neutral mutation hypothesis by dna polymorphism. Genetics, 123(3):585–595, November 1989. 1.4 [13] John H. McDonald and Martin Kreitman. Adaptive protein evolution at the adh locus in drosophila. Nature, 351(6328):652–654, June 1991. 1.4 [14] Brian C Verrelli, John H McDonald, George Argyropoulos, Giovanni Destro-Bisol, Alain Froment, Anthi Drousiotou, Gerard Lefranc, Ahmed N Helal, Jacques Loiselet, and Sarah A Tishkoff. Evidence for balancing selection from nucleotide sequence analyses of human g6pd. Am J Hum Genet, 71(5):1112–28, 2002. 1.4 33 [15] Molly Przeworski. Estimating the time since the fixation of a beneficial allele. Genetics, 164(4):1667–1676, 2003. 1.4, 3.2.2 [16] Mark A. Beaumont, Wenyang Zhang, and David J. Balding. Approximate bayesian computation in population genetics. Genetics, 162(4):2025–2035, December 2002. 1.4, 3.1, 3.2, 3.2.3, 3.2.4, 5.1.2 [17] N. L. Kaplan, T. Darden, and R. R. Hudson. The coalescent process in models with selection. Genetics, 120(3):819–829, 1988. 2.1 [18] N. L. Kaplan, R. R. Hudson, and C. H. Langley. The “hitchhiking effect” revisited. Genetics, 123(4):887–899, 1989. 2.1 [19] M Slatkin. Simulating genealogies of selected alleles in a population of variable size. Genet Res, 78(1):49–57, 2001. 2.1 [20] R. R. Hudson and N. L. Kaplan. The coalescent process in models with selection and recombination. Genetics, 120:831–840, 1988. 2.1 [21] C. Neuhauser and S. M. Krone. The genealogy of samples in models with selection. Genetics, 145(2):519–534, 1997. 2.2, 2.2, 2.3, 5.1.3 [22] John Wakeley. The limits of theoretical population genetics. Genetics, 169(1):1–7, January 2005. 2.2 [23] Paul F. Slade. Most recent common ancestor probability distributions in gene genealogies under selection. Theoretical Population Biology, 58(4):291–305, 2000. 2.3 [24] Nicoleen Cloete, Geoff K. Nicholls, and David J. Scott. Simulation of ancestral selection graphs for monte carlo integration. In Jagoda Crawford and A. J. Roberts, editors, Proc. of 11th Computational Techniques and Applications Conference CTAC-2003, volume 45, pages C391–C404, June 2004. 3 [25] Painter I.S. Shoemaker J.S. and Weir B.S. Bayesian statistics in genetics: a guide for the uninitiated. Trends in Genetics, 15:354–358(5), 1999. 3.1 [26] David A. Tallmon, Gordon Luikart, and Mark A. Beaumont. Comparative evaluation of a new effective population size estimator based on approximate bayesian computation. Genetics, 167(2):977–988, June 2004. 3.1, 3.2.3, 3.2.4 [27] Paul Marjoram and Simon Tavaré. Modern computational approaches for analysing molecular genetic variation data. Nat Rev Genet, 7(10):759–770, 2006. 3.1, 3.2.1, 3.2.3 [28] YX Fu and WH Li. Estimating the age of the common ancestor of a sample of dna sequences. Mol Biol Evol, 14(2):195–199, February 1997. 3.2.1 [29] Paul Marjoram, John Molitor, Vincent Plagnol, and Simon Tavaré. Markov chain monte carlo without likelihoods. Proceedings of the National Academy of Sciences of the United States of America, 100(26):15324–15328, 2003. 3.2.3 [30] Paul F. Slade. Simulation of ’hitch-hiking’ genealogies. Journal of Mathematical Biology, V42(1):41–70, 2001. 5 34 BIOGRAPHICAL SKETCH Sonali Joshi was born in India on Dec 26th 1976. She graduated from high school in 1994 and went on to graduate with a B.Eng. in Electronics and Telecommunications Engineering from Pune University, India in 1998. After graduation she worked for four years in Mumbai, India writing software for manufacturing and financial companies. She returned to graduate school in 2004 at the Florida State University to study Computational Biology and Population Genetics. She graduates with a M.S. in Biological Science in Summer 2007. 35

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Estimating Selection Coefficient Using the Ancestral