* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download CSE280A Class Projects
Copy-number variation wikipedia , lookup
History of genetic engineering wikipedia , lookup
Genome (book) wikipedia , lookup
Human genome wikipedia , lookup
Non-coding DNA wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Polymorphism (biology) wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Gene expression profiling wikipedia , lookup
Metagenomics wikipedia , lookup
Helitron (biology) wikipedia , lookup
Genomic imprinting wikipedia , lookup
Gene desert wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Point mutation wikipedia , lookup
Gene expression programming wikipedia , lookup
Genome evolution wikipedia , lookup
The Selfish Gene wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Group selection wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Designer baby wikipedia , lookup
Hardy–Weinberg principle wikipedia , lookup
Population genetics wikipedia , lookup
Dominance (genetics) wikipedia , lookup
Genetic drift wikipedia , lookup
HLA A1-B8-DR3-DQ2 wikipedia , lookup
CSE280A Class Projects 1 Complex regions architecture The KIR region has a very complex architecture, as the diversity of gene regions helps in the immediate immune response. 1.1 Steps: 1. Visit the IPD-KIR database (http://www.ebi.ac.uk/ipd/kir/) and read the introduction of the attached manuscript to get an overview of the region. 2. Use http://www.ebi.ac.uk/ipd/kir/sequenced_haplotypes.html to download the complete known sequences of KIR in individuals. Call this set S. 3. Download candidate amino-acid sequences for all of the genes that may be found in this region. Call this set G. 4. Use Blast or a similar tool to decide the presence/absence of each gene in G in each of the sequences in S. This may be difficult as the genes are repetitive, and you must use appropriate algorithms to make the correct decision. 5. Write a script to compute and display a dot-plot for any pair of sequences in S. See Figure 1(b) of attached manuscript for an example. Along with the dot-plot, you must have gene locations displayed as vertical and horizontal projections of intervals. Use an optional masking feature to mask out repetitive sequence. 6. Use a short-read fragment simulator to generate paired-end reads from the candidate sequence. Design and implement an algorithm that uses the reads from some s ∈ S to decide presence and absence of each gene g ∈ G. Note that this may be easy or not depending upon the repetitive nature of the genes. Show results that compare your tool to the known structure of the genome. 7. Repeat the previous experiment for pairs of sequences from S. 1 2 Multi-allelic, and polygenic signatures of Selection The goal of this project is to understand selection signatures in multi-allelic (soft-sweep) and polygenic selection. Start by building a forward simulator that can simulate these kinds of selections. 1. Build a standard forward-simulator for haploid population as follows: assume a Wright-Fisher model with N haplotypes from generation to generation. Haplotypes containing a beneficial allele are selected with probability ∝ 1 + s whereas other haplotypes are selected with probability ∝ 1. Each individual is mutated at m sites from its parent, where m is drawn from Poisson distribution with parameter µ. Assume that there is no recombination. 2. In the beginning, start with all haplotypes being all 0, and run the simulator without selection for about 2N generations so that we get a well mixed haploid population. Now (under selective constraints) choose a low-copy number allele as beneficial with selective pressure. This should result in a hard-sweep. Show results on the following after doing very large number of simulation (100,000): (a) Time to fixation of the haplotype containing the beneficial allele as a function of N, s. Try doing the time to fixation in generations a function of 1s ln N s. (b) Generate plots of scaled-site-frequency spectrum, and distribution of haplotype frequencies of common haplotypes in the region as a function of time in generations. 3. Next, generate soft-sweep as follows. Use the same model as above, but introduce selective constraint when the benefical allele already exists in multiple haplotypes. Now, each of these haplotypes will be selected with probability 1 + s. 4. Compute scaled-site-frequency spectra (SSFS) as well as distribution of haplotype frequencies, and contrast your results with the hard-sweep case. 5. To mimic polygenic selection simulate two regions simultaneously. Assume that the regions are unlinked (very far away, or on separate chromosomes), so that an individual chooses a parent independently in each region. Choose an allele in each of the two generations as being beneficial. Simulate as before, except that under selection, individuals containing one or more beneficial alleles are as fit as individuals containing two beneficial alleles. Thus, there is no guarantee that the beneficial allele is driven to fixation. Compute the impact of this on SSFS, etc. in each region. 6. Design and implement new statistics for identifying signatures of polygenic and/or multi-allelic selection. 2