Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Try it: Genetic Drift in Excel In this class you will gain plenty of experience running simulation programs written by others. While you may not want to bother with the nuances of programming languages, you will come to find that Microsoft Excel provides much of the power of a full programming language with a much shallower learning curve. Learning to use it well, or really with any degree of proficiency, will make your academic life much easier. Excel: Pretty much these guys with less marketing. In the lab we have performed Monte Carlo simulations of genetic drift using the program Populus. However, this is actually a very simple task to perform and can be done in Excel quite easily. Furthermore, the flexibility of Excel allows you to easily generate many summary statistics that do not come standard with the simulation programs we use in the lab. Essentially, you can have your cake, eat it, and then bake a much more awesome cake. Below, I show you how to perform a Monte Carlo simulation of genetic drift (with and without selection) in Excel from scratch. Try it, and then play around with it. See what happens! Genetic Drift Without Selection Genetic drift, at its most fundamental, is simply a sampling problem. For each finite population, I sample alleles each generation, and the frequency of each allele will change from generation to generation because I won’t always pick the same number of each allele. Thus, over time the allele frequency will drift, either to fixation or loss. Because I am not worried about selection yet, I only need two parameters to start my simulation: the size of each population and the starting frequency of an allele for a 2-allele locus – I use p. I enter these in the top left corner of a spreadsheet; in this case I use N = 10 and p = 0.1, but I could pick any numbers I wanted. I want to simulate how allele frequencies will change over time simply due to the effects of genetic drift, and I want to do this for a bunch of populations so that I can calculate some summary statistics such as the probability of fixation and the average time to fixation of my allele of interest. To begin with, I set up a population in each column, and I assume that they each have the starting value of p at generation 0. I set up about 150 populations on my spreadsheet, but you can add as many as you like. In fact, you should see how the error of your summary statistics changes as you add more populations. Here is a sample of my first two rows. I should mention that I did not just enter the starting value of p into each of those cells. I actually like to do as little work as possible, so I entered the following into cell E2 and then dragged it across to each of my populations: This references my starting value of p that I entered in the top-left corner. This will allow me to change this starting value and have each population automatically take this into account. Note my use of the “$” signs in front of the letter and the number. This allows me to drag the formula without the letter and number changing. Now for the fun part! Remember that alleles are drawn from a binomial distribution. Therefore, in order to simulate the change in alleles due to chance, I need a way to simulate draws from a binomial distribution consisting of 2N trials (in my example 20) and with a probability of success equal to p in the previous generation. This is where the “Monte Carlo” part comes in (Monte Carlo being a famous casino). I enter the following formula into cell E3 and drag across to all my populations and then drag down for a large number of generations (enough to where everything reaches fixation or loss): The breakdown: 1. $B$1 refers to the number of alleles in the population (number of trials). This is multiplied by 2, because we have 2N trials for a diploid organism. 2. E2 refers to the value of p for this population in the previous generation (no “$” signs because I always want this to refer to p in the previous generation of a given population no matter where I drag it. 3. CRITBINOM(2N,p,RAND()) simulates a random draw from a binomial probability distribution with 2N trials and a probability of success p. CRITBINOM is a function that finds the fewest number of successes that exceed a certain cumulative probability threshold for this binomial distribution (if you’re really interested in what that means, please come see me). In this case that threshold is made with a random number (from a uniform distribution) generated by the RAND function. 4. Because CRITBINOM(2N,p,RAND()) simulates the number of alleles drawn from the previous generation, I divide by the total number of alleles (2N) in order to get the value of p in the following generation. 5. The IFERROR function isn’t entirely necessary, but if it is not included then the function generates an error after the population reaches fixation or loss. This function simply checks if the CRITBINOM function has generated an error (due to fixation or loss) and, if so, outputs the value of p in the previous generation, which will either be 0 or 1. After doing this, my first few rows look something like the following. Note that your numbers will be somewhat different, as all values are dependent upon random numbers. Notice that some populations achieve loss of the allele after only a few generations. Others are headed toward fixation and do eventually get there. Now that I have a bunch of simulated populations, I would really like to get some summary statistics. I’ll show you how to obtain an estimate for the probability of fixation as well as the average time to fixation. Obtaining an estimate for the probability of fixation is actually very easy. I simply need a count of all the populations where the frequency of my allele reached 1 (fixation), and then I will divide that by the total number of populations. Here is the formula I entered into cell B4 (you could use another cell): The breakdown 1. COUNTIF counts the number of cells in a range where a certain condition is met. In this case I’m interested in the last row of the simulation (shown below): Note that all the populations have either gone to fixation or loss. COUNTIF is then simply counting all the populations where the cell value equals 1 (fixation). 2. COUNT is counting all the nonempty cells in the range. I divide the two in order to get the proportion where fixation has occurred, which will serve as my estimate for the probability of fixation. Obtaining the average time until fixation is almost as simple. I only need to get the time until fixation (if it occurs) for each population and then average them together. I obtain the time until fixation for each population using the following formula in the row after the last row in each drift simulation: The breakdown 1. Everything is set within an IF function, which has the following format: IF(condition, value if condition true, value if condition false). The condition in this case is whether the cell above it – the last generation in the simulation – has reached a frequency of 1 (fixation). 2. If fixation has been achieved, then the COUNTIF function counts all the rows (generations) where the frequency is less than 1 (“<1”). This will then equal the number of generations that the population took to reach fixation. 3. If the population did not reach fixation, then the formula will output a blank cell, which is distinguished by empty quotation marks (“”). After inputting this formula, the line after the last row of the simulation appears similar to the following: Notice that only the populations that reached fixation have a value for time until fixation. Now I only need to average these values together in order to get the average time until fixation. Fortunately, the AVERAGE function in Excel ignores blank values, so it will only take into account the populations for which the allele did reach fixation. I enter the following formula into cell B5: After doing this, the left side of my spreadsheet appears as follows, where I have obtained estimates for both the probability of fixation and the average fixation time with a population of 10 diploid individuals with a starting p of 0.1: Now I can be really lazy. Because my formulas all ultimately reference the starting N and p values, I only need to change these around and then see what happens to my summary statistics. First, I try reentering 0.1 for p a few times to get an idea of the error of each estimate: The more populations I add, the less error I will have in my estimates, but these should give you a good idea of where the actual parameter should lie. You can also try varying the starting values of N and p, as I show below: Note that the simulations can take a fair amount of processing power, especially as the population size gets larger. There are more efficient ways of performing these simulations, but these do require some programming knowledge. Feel free to come to my office if you’re interested in learning some alternative methods. Genetic Drift With Selection Now that we’ve created a model for drift without any other intervening forces, we would like to see what happens if some other force, such as selection, is added into the mix. This is actually very easy to do because I only need to modify one formula from the previous spreadsheet. Let us assume the situation of positive (Darwinian) selection, which is often modeled as follows: Genotype Relative fitness Relative fitness in terms of h and s A1A1 ω11 1+s A1A2 ω12 1+hs A2A2 ω22 1 In this case we can find the frequency of the A1 allele (p) after selection using the following formula, found on page 46 of your lab manual. ᇱ = ଶ ሺ1 + ݏሻ + ݍሺ1 + ℎݏሻ 1 + 2ݍℎ ݏ+ ଶ ݏ Now I set up a new spreadsheet much like the one before, except now I include terms for h and s in the top left corner. And now I can set up my spreadsheet the same way I did before except for a modification to my binomial sampling formula. Here is what I enter into cell E3: Don’t panic. These formulas always look more complicated than they actually are. I essentially have the same formula as I had before. I have just replaced p from the previous generation with the formula for p’ shown above. The breakdown 1. 2. 3. 4. $B$1 once again references my N such that it will not move when I drag across or down $B$3 references the value of s, which should also not change $B$4 references the value of h, which should not change E2 is once again the value of p from the previous generation. The only difference is that I have inserted it into the formula to obtain the value of p after selection. Essentially, the formula assumes that the previous generation started with the value of p shown above it, and then selection occurred, such that mating is randomly sampling the pool of alleles after selection. After setting up the rest of the spreadsheet as I did previously, I can now test different values of N, p, s, and h for the probability of fixation and the average time to fixation. I show a few examples below. Theoretically, I could incorporate any other forces I wanted using similar techniques and then test their effects empirically. This is the power of simulation: you’re only limited by your imagination (and sometimes by computational power).