Download Try it: Genetic Drift in Excel

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Try it: Genetic Drift in Excel
In this class you will gain plenty of experience running simulation programs written by
others. While you may not want to bother with the nuances of programming languages, you
will come to find that Microsoft Excel provides much of the power of a full programming
language with a much shallower learning curve. Learning to use it well, or really with any
degree of proficiency, will make your academic life much easier.
Excel: Pretty much these guys with less marketing.
In the lab we have performed Monte Carlo simulations of genetic drift using the program Populus.
However, this is actually a very simple task to perform and can be done in Excel quite easily.
Furthermore, the flexibility of Excel allows you to easily generate many summary statistics that do not
come standard with the simulation programs we use in the lab. Essentially, you can have your cake, eat
it, and then bake a much more awesome cake. Below, I show you how to perform a Monte Carlo
simulation of genetic drift (with and without selection) in Excel from scratch. Try it, and then play
around with it. See what happens!
Genetic Drift Without Selection
Genetic drift, at its most fundamental, is simply a sampling problem. For each finite population, I
sample alleles each generation, and the frequency of each allele will change from generation to
generation because I won’t always pick the same number of each allele. Thus, over time the allele
frequency will drift, either to fixation or loss. Because I am not worried about selection yet, I only need
two parameters to start my simulation: the size of each population and the starting frequency of an
allele for a 2-allele locus – I use p. I enter these in the top left corner of a spreadsheet; in this case I use
N = 10 and p = 0.1, but I could pick any numbers I wanted.
I want to simulate how allele frequencies will change over time simply due to the effects of genetic
drift, and I want to do this for a bunch of populations so that I can calculate some summary statistics
such as the probability of fixation and the average time to fixation of my allele of interest. To begin with,
I set up a population in each column, and I assume that they each have the starting value of p at
generation 0. I set up about 150 populations on my spreadsheet, but you can add as many as you like. In
fact, you should see how the error of your summary statistics changes as you add more populations.
Here is a sample of my first two rows.
I should mention that I did not just enter the starting value of p into each of those cells. I actually like
to do as little work as possible, so I entered the following into cell E2 and then dragged it across to each
of my populations:
This references my starting value of p that I entered in the top-left corner. This will allow me to
change this starting value and have each population automatically take this into account. Note my use of
the “$” signs in front of the letter and the number. This allows me to drag the formula without the letter
and number changing.
Now for the fun part! Remember that alleles are drawn from a binomial distribution. Therefore, in
order to simulate the change in alleles due to chance, I need a way to simulate draws from a binomial
distribution consisting of 2N trials (in my example 20) and with a probability of success equal to p in the
previous generation. This is where the “Monte Carlo” part comes in (Monte Carlo being a famous
casino). I enter the following formula into cell E3 and drag across to all my populations and then drag
down for a large number of generations (enough to where everything reaches fixation or loss):
The breakdown:
1. $B$1 refers to the number of alleles in the population (number of trials). This is multiplied by 2,
because we have 2N trials for a diploid organism.
2. E2 refers to the value of p for this population in the previous generation (no “$” signs because I
always want this to refer to p in the previous generation of a given population no matter where I
drag it.
3. CRITBINOM(2N,p,RAND()) simulates a random draw from a binomial probability distribution
with 2N trials and a probability of success p. CRITBINOM is a function that finds the fewest
number of successes that exceed a certain cumulative probability threshold for this binomial
distribution (if you’re really interested in what that means, please come see me). In this case
that threshold is made with a random number (from a uniform distribution) generated by the
RAND function.
4. Because CRITBINOM(2N,p,RAND()) simulates the number of alleles drawn from the previous
generation, I divide by the total number of alleles (2N) in order to get the value of p in the
following generation.
5. The IFERROR function isn’t entirely necessary, but if it is not included then the function
generates an error after the population reaches fixation or loss. This function simply checks if
the CRITBINOM function has generated an error (due to fixation or loss) and, if so, outputs the
value of p in the previous generation, which will either be 0 or 1.
After doing this, my first few rows look something like the following. Note that your numbers will be
somewhat different, as all values are dependent upon random numbers.
Notice that some populations achieve loss of the allele after only a few generations. Others are
headed toward fixation and do eventually get there. Now that I have a bunch of simulated populations, I
would really like to get some summary statistics. I’ll show you how to obtain an estimate for the
probability of fixation as well as the average time to fixation.
Obtaining an estimate for the probability of fixation is actually very easy. I simply need a count of all
the populations where the frequency of my allele reached 1 (fixation), and then I will divide that by the
total number of populations. Here is the formula I entered into cell B4 (you could use another cell):
The breakdown
1. COUNTIF counts the number of cells in a range where a certain condition is met. In this case I’m
interested in the last row of the simulation (shown below):
Note that all the populations have either gone to fixation or loss. COUNTIF is then simply
counting all the populations where the cell value equals 1 (fixation).
2. COUNT is counting all the nonempty cells in the range. I divide the two in order to get the
proportion where fixation has occurred, which will serve as my estimate for the probability of
fixation.
Obtaining the average time until fixation is almost as simple. I only need to get the time until fixation
(if it occurs) for each population and then average them together. I obtain the time until fixation for
each population using the following formula in the row after the last row in each drift simulation:
The breakdown
1. Everything is set within an IF function, which has the following format: IF(condition, value if
condition true, value if condition false). The condition in this case is whether the cell above it –
the last generation in the simulation – has reached a frequency of 1 (fixation).
2. If fixation has been achieved, then the COUNTIF function counts all the rows (generations)
where the frequency is less than 1 (“<1”). This will then equal the number of generations that
the population took to reach fixation.
3. If the population did not reach fixation, then the formula will output a blank cell, which is
distinguished by empty quotation marks (“”). After inputting this formula, the line after the last
row of the simulation appears similar to the following:
Notice that only the populations that reached fixation have a value for time until fixation. Now I only
need to average these values together in order to get the average time until fixation. Fortunately, the
AVERAGE function in Excel ignores blank values, so it will only take into account the populations for
which the allele did reach fixation. I enter the following formula into cell B5:
After doing this, the left side of my spreadsheet appears as follows, where I have obtained estimates
for both the probability of fixation and the average fixation time with a population of 10 diploid
individuals with a starting p of 0.1:
Now I can be really lazy. Because my formulas all ultimately reference the starting N and p values, I
only need to change these around and then see what happens to my summary statistics. First, I try
reentering 0.1 for p a few times to get an idea of the error of each estimate:
The more populations I add, the less error I will have in my estimates, but these should give you a
good idea of where the actual parameter should lie. You can also try varying the starting values of N and
p, as I show below:
Note that the simulations can take a fair amount of processing power, especially as the
population size gets larger. There are more efficient ways of performing these simulations, but
these do require some programming knowledge. Feel free to come to my office if you’re
interested in learning some alternative methods.
Genetic Drift With Selection
Now that we’ve created a model for drift without any other intervening forces, we would like to see
what happens if some other force, such as selection, is added into the mix. This is actually very easy to
do because I only need to modify one formula from the previous spreadsheet. Let us assume the
situation of positive (Darwinian) selection, which is often modeled as follows:
Genotype
Relative fitness
Relative fitness in terms of h and s
A1A1
ω11
1+s
A1A2
ω12
1+hs
A2A2
ω22
1
In this case we can find the frequency of the A1 allele (p) after selection using the following formula,
found on page 46 of your lab manual.
‫݌‬ᇱ =
‫݌‬ଶ ሺ1 + ‫ݏ‬ሻ + ‫ݍ݌‬ሺ1 + ℎ‫ݏ‬ሻ
1 + 2‫ݍ݌‬ℎ‫ ݏ‬+ ‫݌‬ଶ ‫ݏ‬
Now I set up a new spreadsheet much like the one before, except now I include terms for h and s in
the top left corner.
And now I can set up my spreadsheet the same way I did before except for a modification to my
binomial sampling formula. Here is what I enter into cell E3:
Don’t panic. These formulas always look more complicated than they actually are. I essentially have
the same formula as I had before. I have just replaced p from the previous generation with the formula
for p’ shown above.
The breakdown
1.
2.
3.
4.
$B$1 once again references my N such that it will not move when I drag across or down
$B$3 references the value of s, which should also not change
$B$4 references the value of h, which should not change
E2 is once again the value of p from the previous generation. The only difference is that I have
inserted it into the formula to obtain the value of p after selection. Essentially, the formula
assumes that the previous generation started with the value of p shown above it, and then
selection occurred, such that mating is randomly sampling the pool of alleles after selection.
After setting up the rest of the spreadsheet as I did previously, I can now test different values of N, p,
s, and h for the probability of fixation and the average time to fixation. I show a few examples below.
Theoretically, I could incorporate any other forces I wanted using similar techniques and
then test their effects empirically. This is the power of simulation: you’re only limited by your
imagination (and sometimes by computational power).