* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Bio 113/244 Problem Set #1
Expanded genetic code wikipedia , lookup
Hybrid (biology) wikipedia , lookup
DNA barcoding wikipedia , lookup
Metagenomics wikipedia , lookup
Frameshift mutation wikipedia , lookup
Human genetic variation wikipedia , lookup
Genome evolution wikipedia , lookup
Dominance (genetics) wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Helitron (biology) wikipedia , lookup
Polymorphism (biology) wikipedia , lookup
Genetic code wikipedia , lookup
Point mutation wikipedia , lookup
Hardy–Weinberg principle wikipedia , lookup
Genetic drift wikipedia , lookup
Koinophilia wikipedia , lookup
Bio 113/244 Problem Set #1 1) Imagine that DNA contained 7 nucleotides instead of 4. Derive the JukesCantor correction for this situation. Draw the graph of the dependence of the probability of starting with one nucleotide and ending with the same one (Pr[X,X]) after a time (t). Draw a similar graph for Pr[not-X,X]. 2) Substitutions between leucine and isoleucine are much more common in proteins than substitutions between leucine and serine. How would you explain this observation? Consider all possibilities and suggest how you would distinguish among them. 3) Noncoding DNA in a newly-sequenced microbe is 50% GC rich and has roughly the following proportions of dinucleotides of the form (CpX): CpA – 10%, CpT-20%, CpG-50%, CpC-20%. Provide an explanation for this pattern. (Hint: Compare it to the situation in the mammalian genome). 4) Derive equation #3.27 from Graur and Li (Hint: instead of proceeding as in the one-parameter model, first derive P (= probability of two sequences differing by a transition) and Q (= probability of two sequences differing by a transversion), decide what K (= number of substitutions per site since time of divergence) should be and rearrange to get K in terms of P and Q). 5) Suppose you have the following data on genotype frequencies for the ABO blood groups: Genotype Phenotype OO 0.4 OA 0.3 AA 0.08 OB 0.12 BB 0.04 AB 0.06 Calculate the genotype frequencies for the next generation assuming random mating. 6) In Margaret Atwood’s novel Oryx and Crake, the character Crake designs a new species of human being, bereft of the flaws in Homo sapiens responsible for hunger, disease, and war. Crake substitutes his creation for normal humans by designing a Homo sapiens specific virus that he releases worldwide. Suppose this new species starts with a population size of 100 that by design doubles every year for ten years and then remains constant in size. a) Assuming the Wright-Fisher model, what is the variance effective population size (Ne) during the tenth year? b) Suppose Crake designed the new species to have extremely high fidelity DNA replication. If the observed total heterozygosity is 2.2 x 10-9, did Crake succeed based on the estimated mutation rate? 7) The following DNA sequences come from an orthologous locus in two cosmopolitan species of fly, Drosophila simulans and Drosophila melanogaster. Only one sequence has been sampled from the melanogaster population, but six sequences have been sampled from different individuals in the simulans population. mel TGAGTTTTTGCCACGAGATAGCAAAGTGGTCATTATTCTT sim1 sim2 sim3 sim4 sim5 sim6 TGAGTTTTTGCTACGAGATGGCAATGTGGTCATTATTCTT TGAGTTTTTGCCACGAGATGGCAATGTGGTCATTATTCTT TGAGTTTTTGCCACGAGATGGCAATGTGGTCATTATTCTT TGAGTTTTTGCCACGAGATGGCAAAGTGGTCATTATTCTT TGAGTTTTTGCCACGAGATGGCAAAGTGGTCATTATTCTT TGAGTTTTTGCCACGAGATGGCAAAGTGGTCATTATTCTT a) Assume that the left end of the sequences is the 5' end. Could these sequences code for part of a protein? b) How many segregating sites are there in the D. simulans sample? c) Assume that the effective population size for D. simulans is 10^6. Estimate the mutation rate based on the simulans sample. d) How many sites diverge between mel and sim1? e) Assume that the mutation rate you calculated in (c) has remained constant since the two species split. Assume further that these species reproduce ten times per year. Estimate how long ago the species split, using: i) the raw divergence between mel and sim1, ii) the Jukes-Cantor correction, and iii) the Kimura correction? f) How do your answers change if you repeat the calculations in (e), but between mel and sim2? What does this mean? g) According to the best evidence available, D. simulans and D. melanogaster diverged around 3 million years ago. Is this consistent with your estimates for the divergence time? 8) Polymorphism or Divergence? The following terms are associated with polymorphism, divergence, or both. In each case, argue for one of the three options. (Example: The term 'between-species' is probably more strongly associated with divergence than polymorphism. After all, divergence measures the number of changes in a locus between species.) between-species segregating site fixation nucleotide diversity natural selection random genetic drift within-population heterozygosity mutation genetic diversity substitution effective population size 9) Write a computer simulation of genetic drift. Start with a population of N asexual types, all different. Produce a new population by random sampling, and continue until all are identical. Use the program to check the statement that the expected time to fixation of one of the types is 2N generations, with a standard deviation a little greater than N. Then extend your simulation to include diploid types, and check the statement again. 10) Middle-Earth has two distinct populations of elves: those of Lothlorien, and those of Mirkwood. The woods of Lothlorien are much better suited to elfhabitation than those of Mirkwood, so there are twice as many elves in the former as in the latter. These communities have been isolated from each other for a long enough time that allele frequencies at many loci have changed, but they are still fully capable of interbreeding, and therefore must be considered a single species. Each population has only 2 alleles at the disposition locus (congenial or arrogant), but the frequencies of the alleles are not the same in the two populations. Each population is initially in Hardy-Weinberg equilibrium with itself. Now that the Adversary has been defeated, the communities merge on their way back to the Elf Havens across the Sea. a) Calculate the change in the proportion of homozygotes between the P1 generation where the populations merge, and the F1 generation, assuming that mating is now random within the entire Elven community. b) Does the sign of the change in the proportion of homozygotes depend on whether the larger or smaller population initially had a higher incidence of arrogance? Explain your answer. 11) The peppered moth Biston Betularia can be one of two colors, white or dark brown. A single locus with two alleles is responsible for determining the body color phenotype. Allele ‘M’ is dominant to ‘m’, and its presence leads to a greater production of melanin that darkens the moth’s body color. An extremely large population of the peppered moth has thrived in a forest of dark brown and white barked trees for many centuries. 55% of the moths in this population are white in color. Both white and dark brown moths can survive in the forest because they are camouflaged against trees that match their body color. When a fire sweeps though the forest, the benefits enjoyed by the white moths are eliminated. While many of the moths and birds in the forest survive the blaze, the dark colored moths are now better camouflaged against the dark burnt bark of the trees. The white moths are so poorly camouflaged that they are immediately eaten by predatory birds. None of the white moths are ever able to reproduce. NOTE: Assume Hardy-Weinberg assumptions for the following questions including an infinite population size. a) What are P (MM), Q (Mm), R (mm), p (frequency of M) and q (frequency of m) for the population of moths before the forest fire? b) Assuming both colors of moths survive the forest fire with equal probability, what are P, Q, R, p and q for the first generation of moths after the fire (at the time when they are born). c) If white moths are destroyed generation after generation, find an equation for qt in terms of qo. d) After how many generations will q = .05? How long will it take until q is exactly 0. 12) A great catastrophe befalls Whoville; when Horton falls asleep, some of the other residents of the Jungle of Nool (claiming that they are acting in the best interests of the poor deranged elephant) boil the dust speck that contains the Who's world. Only 1 male and 1 female Who survive the horrible calamity. Horton recovers the dust speck, and with the aid of caffeine and increased vigilance, he protects the tiny dust speck long enough for the survivors to produce a (F1) generation of 10 Whos. The ability for these Whos to produce substantial noise is controlled by a simple 2-allele locus, and before the catastrophe the incidence of the loud allele was .25, while the incidence of the quiet allele was .75. Assuming that the Whos were in Hardy-Weinberg equilibrium before the catastrophe, a) Use a Wright-Fisher model to predict the probability of the quiet allele being extinct in the F1 generation. b) Use a Wright-Fisher model to predict the probability of the loud allele being extinct in the F1 generation. c) Identify the shortcomings of the Wright-Fisher model in this example (ie, what might actually happen in reality that wouldn't be indicated in the Wright-Fisher model). Give an example in which these shortcomings could substantially change the probabilities calculated in the parts b and c. 13) Scientists discover a very basic form of life on Mars. The genetic system is based on only four amino acids - X, Y, Z, and W - and two nucleotides – A and B. Codons in this new system are only two nucleotides long and code for the amino acids according to the following table. Nuc leoti des AA AB BA BB Amino Acids X Y Z W Assuming that a nucleotide changes to any other in a given time step t with probability b, answer the following questions about the substitutions in a particular “neutral” amino acid sequence. HINTS: • for simplicity assume that only a single substitution can occur in a given time step. • compare the problem to Kimura’s 2-parameter correction a) What is the equation K(t) for the expected number of amino acid changes that actually do occur after time t. b) Find a correction for the number of changes that have occurred (K) in terms of the number of observed amino acid substitutions (Hint: there are two types of substitutions. Include both in the formula for K.) 14) Elephant seals possess an interesting system of mating. One alpha-male lies in the center of a harem of females and attempts to mate with as many of these females as possible throughout the course of the mating season. The harem is also surrounded by 5 beta-males, usually younger and smaller, who lie around the harem and protect it from invasion by other males. When the alpha-male starts to mate, the beta-males use his distraction to do some mating of their own with the females at the outer edges of the harem. Other males, not alphas or betas, stay in the water and are not allowed to mate at all. One would think that the alpha male sires the most offspring in this situation, however, recent genetic studies have found that beta-males as a group actually have at least as much mating success as the alpha-male. From a typical harem 50% of the children are sired from beta males and 50% are sired from the alpha. Given that there are 40 alpha-males, and 2000 females in the population, assuming that each of the harems is exactly the same size, assuming that all betamales have exactly the same probability of having an offspring, and assuming that all females are fertile and available for mating, calculate the effective population size for this population of elephant seals. NOTE: assume non-overlapping generations for simplicity. 15) You find out that your favorite protein is translated from the following mRNA: 5’ CUAUGGCAACAUCAUCAGCGGCA 3’ a) Write down the amino acid sequence of the protein if translation starts at the first Met codon encountered in the mRNA sequence b) Now assume that translation starts at the first Tyr codon. Write down the amino acid sequence of the new protein. c) Which protein should experience more amino acid changes per unit time if all cytosines in your organism are methylated? Explain. 16) Imagine that you have cloned two homologous genes in two species of Drosophila. You sequence and align them. Below are the short parts of the alignments of the expected mRNA sequences. The sequences are parsed into expected codons (for instance, the first codon in the Drosophila simulans sequence of Gene 1 is AUC) Gene 1: Drosophila sim: Drosophila mel: AUC-ACC-CAC-CAA-CAG-UUC-UGU-GCU AUG-ACA-CAC-CAA-CGG-UUC-UGC-GAU Gene 2: Drosophila sim: Drosophila mel: ACA-GAU-GGU-CCU-CGC-GUG ACA-CUU-AGU-AUU-CAC-GCA a) Above are protein sequences from two genes in two closely related species, D.simulans and D.melanogaster. Assuming that the path with the fewest nonsynonymous substitutions represents the true path between the two codons, calculate Ka and Ks for both genes. b) Would you be surprised to learn that both genes have no function at the level of the protein? Explain. c) Would you be surprised to learn that Gene 2 has been under strong selection to change its protein sequence? Explain. 17) You are given two sequences (50 bp each) of the homologous pseudogene in two species of yeast. The alignment is shown below: CCTCGACGGCTTAGATCTGATCTGACCTAATGCTGCAATCGTTACAAAGT CCTCCACGAGTAAGAGTTGATCCGACTTAGTCCTGCGATCGTTAGATAAT You know that these species last shared a common ancestor 10 MYA and that both species go through 50 generations a year. a) Using Jukes-Cantor model of nucleotide substitution, estimate mutation rate per nucleotide per year in these two species of yeast. Assume that mutation rate is the same in both species. b) Do you believe that Jukes-Cantor model is appropriate in this case? Do you see any evidence that you need to use Kimura 2-parameter model instead? Explain. c) You sample 10 alleles of this pseudogene different by origin in the population of one of these species of yeast. You sequence the same 50 bp region in all 10 cases and find that there are 5 distinct alleles in the following proportions: allele 1 allele 2 allele 3 allele 4 allele 5 6 1 1 1 1 Estimate the effective population size in this species of yeast. 18) In Drosophila the rates of all transitions and transversions are all very similar to each other except for the transition from C to T (or equivalently from G to A). You know that the average GC content of neutral, entirely unconstrained DNA in Drosophila is 34%. Estimate the relative probability of a C to T transitions versus T to C transition. 19) Your friend just sequenced the full genome of a new species of bacteria. The GC content in this species is 50%. Your friend decides to calculate the proportion of different kinds of nucleotide pairs (dinucleotides) and finds among other things that C’s are followed by A’s 40% of the time while 60% of the time they are followed by the other three nucleotides. This reminds you of a similar pattern in the human genome. From what you know about methylation-dependent deamination, advise your friends which other dinucleotide frequencies he needs to look at and give qualitative predictions of what he should see. 20) A molecular biologist discovers that human cells that express a particular allele of a particular membrane protein do not get infected by the HIV virus. Excitedly he measures the frequency of this allele in the human population and discovers that it is present in approximately 10% of the population. He finds this results surprising because only ~1% of the human population shows natural immunity to HIV and not 10%. Knowing that you are taking a class in molecular evolution, he asks you for help. What would you tell your friend? Is his result consistent with the observations of natural immunity to HIV? Tell us in a few sentences what you can surmise from this information. 21) A neutral region of a Drosophila genome is 60% AT rich (60% of the nucleotides are AT pairs and 40% are GC pairs). Assuming that the mutational pressure is solely responsible for maintenance of this AT content, and that mutation operates the same way on both DNA strands, fill in the missing parts of the mutation frequency table. From To A T G C A x T G x 0.15 x C 0.06 0.08 0.06 x 22) In the fly Drosophila mauritania, a transposable element called mariner exists at a several loci in the genome and segregates neutrally. Each copy is deleted from the genome with a frequency of .5 percent per generation. In a population in which mariner is fixed at a specific locus, how many generations will it take until the frequency of individuals that are homozygous for the deletion is 5 percent? 23) Imagine that you discover life on Mars and find that martian DNA contains 5 different nucleotides. a) Assuming that amino acids are still coded with triplets on Mars and that the Mars universal genetic code contains 3 stop codons, what is the maximum number of aminoacids potentially encoded by Martian genes b) Derive the Jukes/Cantor correction applicable on Mars. 24) You study two anonymous regions in the maize and rice genomes. You find that in the first region (region A) 25% of nucleotide positions are different in the two species and in the other one (region B) only 5% are different. a) This result is entirely consistent with the neutral theory. Explain why. What would Kimura infer about the biological difference between the two regions? b) You proceed to study nucleotide polymorphism in these two regions. In a sample of 10 maize alleles from locus A (2000 bp in length) you find 20 segregating sites. On the assumptions of the neutral theory, how many segregating sites do you expect to observe in a sample of 5 alleles of the same length (2000 bp) in the region B? 25) Use the coalescence approach to find the expected time to the common ancestor of a very large sample (size n) in an even much larger diploid population of the size N. The following fact should prove helpful: the sum(1/n(n-1)) is ~ 1 as n becomes very large, 26) You sample 5 DNA sequence alleles from a population and discover 10 segregating sites. On the assumptions of the neutral theory, how many more segregating sites do you think you will observe if you sample 5 more alleles? 27) The population of unicorns has the effective size that is 10 times greater than the effective size of the leprechaun population. The generation time of leprechauns is 10 times longer than generation time of unicorns. You observe that there are 500 nucleotide differences between the sequences of a particular gene in leprechauns and unicorns. Assuming that the neutral theory is correct, and that mutation rate per generation is the same in both unicorns and leprechauns determine a) How many mutations were fixed in this gene since the common ancestor of leprechauns and unicorns b) How many mutations in this gene were fixed in the leprechaun populations since the common ancestor? How many were fixed in the unicorn population c) What the expected ratio of heterozygosities is in the unicorn and leprechaun populations 28) You manage a population of an endangered species that is in danger of inbreeding depression (loss of genetic diversity). You monitor the population for five years and find that the sex ratio and overall population size fluctuate as shown below: Year Males/Female Population Size 1 .6 200 2 .7 400 3 .5 100 4 .8 600 5 .6 200 A survey of genetic diversity at two loci reveals the following genotype frequencies: Locus 1 Genotype A1A1 A1A2 A2A2 A1A3 A2A3 A3A3 Freq .01 .1 .25 .08 .40 .16 Locus 2 Genotype B1B1 B1B2 B2B2 Freq .04 .72 .24 a) Do any of these loci appear to be under selection? (Explain how you can tell.) b) Assuming your observation of the population's demographics over the last five years is representative of its future, how long will it take for the genetic diversity at these loci to decay to 90% of their current levels? c) If you wanted to slow this process, would you be better served by stabilizing the population size at 500 (with sex ratio continuing to fluctuate), or stabilizing the sex ratio at 1? 29) You sequence part of a pseudogene in five, randomly sampled chromosomes: AAGCTGGACT AAGCTGGACT AAGCTAGACT GAAGCTAAGC GAAGCTAAGC GAAGCTAAGC TATTACGACG TATTGCGACG TATTGCGACG GCCATTACGA GCCATTACGA GCCATTGCGA AAGCTCCGTT AAGCTCCGTA AAGCTCCGTA AAGCTAGACC AAGCTAGACC GAAGCTAAGC GAAGCTAAGC TATTGCGACG TATTGCGACG ACCATTGCGA ACCATTGCGA AAGCTCCGTT AAGCTCCGTT a) You know from other studies of this species that the mutation rate is approximately 10-8 mutations/(site*generation). How large do you think the population is? b)* Say your sample looked like this: AAGCTGGACT AAGCTGGACT AAGCTAGACT AAGCTGGACT AAGCTGGACC GAAGCTAAGC GAAGCTAAGC GAAGCTAAGC GAAGCTAAGC GAAGCTAAGC TATTACGACG TATTGCGACG TATTGCGACG TATTGCGACG TATTGCGACG GCCATTACGA GCCATTGCGA GCCATTGCGA ACCATTGCGA GCCATTGCGA AAGCTCCGTT AAGCTCCGTA AAGCTCCGTT AAGCTCCGTT AAGCTCCGTT What would be the topology (shape) of this sample's coalescent tree? What kind of population history would result in this pattern? (Hint: Think about what determines the probability of a coalescent event at any given time in the population's history.) 30)* You discover a peculiar mating system in a plant species where 30% of the individuals only self-fertilize and 70% of the individuals only cross-fertilize. You also observe that both mating types produce equal numbers of offspring plants and that a particular plant’s phenotype is completely independent of their parent’s phenotype. Given this information, a) Determine the effective population size for a population of 10,000 plants. b) Can you find frequencies of self-fertilizers and cross-fertilizers that would make Ne = 10,000? (That is change the 30 and 70%) c) Instead of the mating system described above, assume all individuals self to make 30% of their progeny and cross to make the other 70%? What would Ne be in this case? 31) Deleterious mutations can be maintained in a population even as selection purges them from the population if they recur with some frequency. Equation #3.9 in Gillespie shows the equilibrium frequency of such a deleterious allele given incomplete dominance. Derive the equilibrium frequency of a deleterious allele that is completely recessive (i.e. h = 0) (hint: refer to Gillespie’s method on p. 70 but calculate Δq instead).