* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Alleles - Amazon S3
Public health genomics wikipedia , lookup
Designer baby wikipedia , lookup
Genetic engineering wikipedia , lookup
Genetics and archaeogenetics of South Asia wikipedia , lookup
Pharmacogenomics wikipedia , lookup
Heritability of IQ wikipedia , lookup
Dual inheritance theory wikipedia , lookup
SNP genotyping wikipedia , lookup
Medical genetics wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Viral phylodynamics wikipedia , lookup
Genome (book) wikipedia , lookup
Gene expression programming wikipedia , lookup
Group selection wikipedia , lookup
Genome-wide association study wikipedia , lookup
Frameshift mutation wikipedia , lookup
Koinophilia wikipedia , lookup
Polymorphism (biology) wikipedia , lookup
Point mutation wikipedia , lookup
Human genetic variation wikipedia , lookup
Hardy–Weinberg principle wikipedia , lookup
Dominance (genetics) wikipedia , lookup
Genetic drift wikipedia , lookup
CSE291: Personal genomics for bioinformaticians Class meetings: TR 3:30-4:50 MCGIL 2315 Office hours: M 3:00-5:00, W 4:00-5:00 CSE 4216 Contact: [email protected] Today’s schedule: • 3:30-4:15 Intro to population genetics • 4:15-4:20-break • 4:20-4:50 Overview of PS2, Journal discussion Announcements: • PS1 due tomorrow. Any issues with XSEDE? • PS2 is out. Due Jan. 26. • Readings for Lecture #3 posted (short) Intro to population genetics CSE291: Personal Genomics for Bioinformaticians 01/12/17 Outline • Patterns of genetic variation • Linkage disequilibrium • Discuss problem set 2 • Journal club: Positive selection Patterns of genetic variation Four forces driving evolution • Mutation: introduces genetic variation • Genetic drift: Evolution is inherently a stochastic process • Natural selection: Advantageous mutations more likely to spread • Gene flow: exchange of genetic material between populations “Nothing in biology makes sense except in the light of evolution” – Theodosius Dobzhansky (1973) You share genes with me Identical twins: 0 differences Africa West Eurasia Unrelated humans 1/1,000 South Asia America Human vs. chimp 1/100 East Asia/Oceania Mallick, et al. 2016 Mutation: the fuel for evolution TCATTTGAGCATTAAATGTCAAGTTCTGCAC TCATTTGAGCATTAAGTGTCAAGTTCTGCAC Rs12913832 (e.g. eye color mutation) • Human mutation rate ~1.5 x 10-8 mutations/bp/generation. • ~50 new mutations per generation • Classes of mutations: • Single nucleotide polymorphisms (SNPs) • Insertions/deletions (indels) • Structural variants (SVs), copy number variants (CNVs) Roach et al. 2010; Conrad et al. 2011 Mutation: a not-so-random process • Per-generation mutation rate highly dependent on paternal age • Local sequence features influence local mutation rates across the genome. Most important: trinucleotide context (e.g. the base right before and right after a given position) Michaelson, et al. 2012 The language of genetic variation Chromosome Diploid: contains two complete sets of chromosomes, one from each parent Haploid: contains one complete set of chromosomes TTAAATGTC A,G AA, AG, GG Locus: e.g. chr1:1000 Allele: one of two or more alternative versions of a specific locus due to mutation Genotype: specifies the allele on each chromosome copy. More terminology about SNPs C C C C A C C A Reference allele: which nucleotide is at this position in the reference genome? Alternate allele: an allele in the population that doesn’t matches the human reference Major allele: most common allele for a given position. In this example, C. Note not always does major allele = reference allele. Minor allele: any allele besides the major allele. In this example, A Minor allele frequency (MAF): frequency of the minor allele. Here MAF=0.25 rsid: a unique identifier for a given SNP, as used by dbSNP Four forces driving evolution • Mutation: introduces genetic variation • Genetic drift: Evolution is inherently a stochastic process • Natural selection: Advantageous mutations more likely to spread • Gene flow: exchange of genetic material between populations “Nothing in biology makes sense except in the light of evolution” – Theodosius Dobzhansky (1973) Genetic drift Each generation: choose n alleles with replacement Allele Freq. Gen. 1 pred=0.50; pblue=0.5 Gen. 2 pred=0.35; pblue=0.6 Gen. 3 pred=0.40; pblue=0.6 Gen. 4 pred=0.30; pblue=0.7 Gen. 4 pred=0.20; pblue=0.8 Fixation … pred=0.00; pblue=1.0 Wright-Fisher model N individuals 2N alleles in a diploid population 2 alleles: A, B p frequency of allele A q frequency of allele B Assumptions: • Non-overlapping generations • Constant population size • New generation drawn at random pkq2N-k Probability that next generation has k copies of allele A:2N k (this is taken directly from the binomial distribution) ( ) • In a finite population, what eventually happens to allele frequencies? Fixation or loss • How is rate of fixation related to population size?Grows with population size Time to fixation: Tfixed = -4N(1-p)ln(1-p) ≈4N in large population with p=1/2N p • What is the probability of fixation of an allele due to drift?Initial frequency p Example: Effect of bottlenecks Ancestral population: pred= 7.4% pgreen=7.4% pblue=85.2% Bottleneck Founder population: pred= 25% pgreen=0% pblue=75% Bottlenecks have dramatic effects on frequencies of rare alleles Example bottleneck populations: • Early American settlers • Several thousand French Canadian settlers. High rate of Leigh Syndrome, Tyrosinemia Type I • Ashkenazi Jewish • Descend from several hundred people. High rate of e.g. Gaucher, Tay-Sachs, CF • Finland • Bottleneck 4000 yrs ago. 36 “Finnish heritage diseases” e.g. Usher Syndrome Effective population size • In reality, population size is not constant! • “Effective population size” captures what the size would have to be for the population to behave like an idealized population • Larger effective population size -> more genetic diversity • Recent population explosion -> tons of new rare alleles. https://en.wikipedia.org/wiki/World_population#/media/File:Population_curve.svg Effective population size – differs by ancestry McEvoy et al. 2011 Hardy Weinberg Equilibrium Hardy-Weinberg principle: allele and genotype frequencies remain constant in the population over time if all other evolutionary forces are held constant Alleles A B f(A) = p f(B) = q = 1-p Genotypes Under HWE: Female Male A (p) B (q) A (p) AA (p2) AB (pq) B (q) AB (qp) BB (q2) f(AA) = p2 f(AB) = 2pq Assumptions: • Diploid • Sexual reproduction • Random mating • Large population size • No selection, mutation, migration Useful as a quality metric for SNP calling f(BB) = q2 f(A) = f(AA) + 0.5f(AB) = p2 + pq = p(p+q) = p f(B) = f(BB) + 0.5f(AB) = q2 + pq = q(q+p) = q Four forces driving evolution • Mutation: introduces genetic variation • Genetic drift: Evolution is inherently a stochastic process • Natural selection: Advantageous mutations more likely to spread • Gene flow: exchange of genetic material between populations “Nothing in biology makes sense except in the light of evolution” – Theodosius Dobzhansky (1973) Natural selection https://9sc4evolution.wikispaces.com/4.+Natural+Selection Natural selection Selection coefficient s: relative fitness of red vs. blue P(red sampled) = pred(1+s) s<0: red is deleterious s>0: red is beneficial s=1: standard Wright Fisher, no selection pred(1+s) + pblue e.g. with s>>0, fixation happens quickly Gen. 1 pred=0.50; pblue=0.50 Gen. 2 pred=0.60; pblue=0.40 Gen. 3 Fixation … pred=0.80; pblue=0.20 pred=1.00; pblue=0.00 Patterns of selection AA AB BB AA Fitness Disruptive Fitness Balancing Fitness Directional AB BB Genotype Genotype Example: lactose tolerance Example: sickle-cell anemia AA AB BB Genotype Example: ? butterfly colors Example directional selection: lactose intolerance • Most humans can’t digest lactase past childhood. • A SNP in the gene LCT (C/T-13,910) is associated with the ability to digest milk in Europeans (autosomal dominant) • Allele frequency decreases North to South in Europe, thought to be associated with strong selective advantage in dairy farmers • An independent mutation in the same gene in African On populations 23andMe: rs4988235T/A near LCT gene. (I am AA=tolerate lactose) also associated with lactase persistance. Example balancing selection: sickle cell anemia Observation (Haldane): red blood cell disorders (e.g. sickle-cell anemia, thalassemias) common in areas where malaria is endemic. Sickle cell anemia: due to mutations in beta hemoglobin (HBB), patients experience infections, pain, and fatigue Homozygous wildtype: normal Heterozygous: “sickle cell trait”, resistant to malaria! Homozygous mutant allele: affected by sickle cell anemia https://prezi.com/5yynhkp0ffnj/sickle-cell-anemia/ On 23andMe: i3003137 (rs334) T/A in gene HBB. (I am TT=homozygous wildtype) More examples: skin color World distribution of A111T polymorphism in SLC24A5 • A111T mutation in SLC24A5 associated with lighter skin pigmentation in Europeans • Became predominant in Europe ~10-20K years ago • Hypothesis: selection for derived allele based on need for sunlight to produce vitamin D • One of the strongest signals of positive selection in Europeans On 23andMe: rs1426654 A/G in SLC24A5. (I am AA=light skin) Canfield et al. 2014 Sabeti et al. 2007 Allele frequency vs. fitness effect severe Effect size e.g. Cystic Fibrosis, Tay-Sachs Nonexistent (removed by selection) Severe Mendelian disorders Likely many examples, but low power to detect these mild rare e.g. high cholesterol, Crohn’s Disease, Type II Diabetes (many common alleles with small effect sizes) common Allele Frequency Four forces driving evolution • Mutation: introduces genetic variation • Genetic drift: Evolution is inherently a stochastic process • Natural selection: Advantageous mutations more likely to spread • Gene flow: exchange of genetic material between populations “Nothing in biology makes sense except in the light of evolution” – Theodosius Dobzhansky (1973) Admixture in modern and ancient humans Ewen Callaway Nature News 201 Neanderthal admixture and diabetes Risk variants inherited from Neanderthal admixture! Linkage disequilibrium Recombination https://www.reddit.com/r/askscience/comments/3hq4zl/does_crossover_occur_in_all_4_nonsister/ Example history of two neighboring alleles Present-day polymorphisms are the result of ancient mutation events: A Ancestral haplotype C A After mutation C C T A After another mutation C T C T G Haplotype: combination of variants on the same copy of a chromosome Recombination rearranges ancestral alleles A Before recombination C T C T G A After recombination C T C T G A G Recombination event! Linkage disequilibrium – present day samples Ancestral haplotype Present-day haplotypes Patterns at present-day haplotypes depend on: • Recombination rate • Mutation rate • Population size • Natural selection The closer two markers are, the more likely they are to share the same ancestral haplotype Nearby SNPs are highly correlated Recombination induces haplotype “blocks” of correlated SNPs Linkage disequilibrium decays with distance http://graphics.cs.wisc.edu/WP/vis10/archives/458-hapmap-linkagedisequilibrium-plot Factors affecting LD LD decays faster in populations with higher heterozygosity Recombination occurs in “hotspots” along each chromosome A global reference for human genetic variation. Nature 2015 Mammalian recombination hot spots: properties, control and evolution. Nature Reviews Genetics 20120 LD ends up saving us $$ - “Tag SNPs” • LD induces tight correlation between nearby variants • Rather than genotyping and testing each SNP for association with disease, can focus on a few “tag” SNPs Haplotype 1 AACACAAGCTAGCTACCTACGTAGCTACAT 30% Haplotype 2 AACACAAGCTAGCTACCTACGTAGGTACAT 30% Haplotype 3 AACACTAGCTAGCTACCTATGTAGGTACAT 20% Haplotype 4 AACACTAGCAAGCTACCTACGTAGGTACAT 10% Haplotype 5 AACACTAGCAAGCTACCTACGTAGGTACGT 10% LD ends up saving us $$ - Imputation AACACAAGCTAGCTACCTACGTAGCTACAT AACACAAGCTAGCTACCTACGTAGGTACAT Reference haplotype panel (e.g. 1000 Genomes) AACACTAGCTAGCTACCTATGTAGGTACAT AACACTAGCAAGCTACCTACGTAGGTACAT AACACTAGCAAGCTACCTACGTAGGTACGT AACACTAGC?AGCTACCTA?GTAGGTAC?T AACACAAGC?AGCTACCTA?GTAGGTAC?T Your data genotyped on SNP chip AACACTAGC?AGCTACCTA?GTAGGTAC?T AACACTAGC?AGCTACCTA?GTAGGTAC?T AACACAAGC?AGCTACCTA?GTAGCTAC?T Linkage disequilibrium metrics A/a pAB=pApB pAb=pA(1-pB) paB=(1-pA)pB paB=(1-pA)(1-pB) Linkage disequilibrium: pAB≠pApB pAb≠pA(1-pB) paB≠(1-pA)pB paB≠(1-pA)(1-pB) b Total pAB pAb pA paB pab pa pB pb 1.0 B/b Frequencies: A a Linkage equilibrium: B Total D: pAB-pApB • Sign is arbitrary • Range depends on allele frequencies D’: D/Dmin (normalize by theoretical maximum) • Dmin=max(-pApB, -(1-pA) (1-pB) if D<0 • Dmin=min(pA(1-pB), (1-pA)pB) if D>0 • Range between -1 and +1 r: D/sqrt[pA(1-pA) pB(1-pB)] • Same as correlation coefficient. Between -1 and • R2 gives power loss in association studies • Most commonly used in population genetics Problem set 2 Prob. 1: Principal components ancestry analysis Lecture #3 (01/17/17): Determining Ancestry Prob. 2: Relative finding Given a set of genomes, estimate all pairwise relationships Lecture #3 (01/17/17): Determining Ancestry Prob. 3: Imputation AACACAAGCTAGCTACCTACGTAGCTACAT AACACAAGCTAGCTACCTACGTAGGTACAT Reference haplotype panel (e.g. 1000 Genomes) AACACTAGCTAGCTACCTATGTAGGTACAT AACACTAGCAAGCTACCTACGTAGGTACAT AACACTAGCAAGCTACCTACGTAGGTACGT AACACTAGC?AGCTACCTA?GTAGGTAC?T AACACAAGC?AGCTACCTA?GTAGGTAC?T Your data genotyped on SNP chip AACACTAGC?AGCTACCTA?GTAGGTAC?T AACACTAGC?AGCTACCTA?GTAGGTAC?T AACACAAGC?AGCTACCTA?GTAGCTAC?T Lecture #4 (01/19/17): Phasing and Imputation Journal club: Positive selection Signals of selection Signs of positive selection: • High derived (non-ancestral) allele frequency • High differentiation between populations (reflecting recent local adaptations) • Long haplotypes: new alleles have risen to high frequency without enough time to break down haplotypes by recombination Tests: LRH, iHS, XP-EHH Signs of purifying selection: • Less variation than expected by chance. Deleterious allele is rare. Tests: Tajima’s D, Fu and Li Test, many others Websites to explore specific variants • dbSNP (https://www.ncbi.nlm.nih.gov/snp). Most SNP identifiers (rsids) are from dbSNP. Website documents source of data, MAF, alleles. • ExAC (http://exac.broadinstitute.org/). Genetic variation from 60,000+ samples. Provides allele frequencies by population • SNPedia (https://snpedia.com/). Detailed curated information about specific SNPs Examples • LCT - rs4988235 (lactose tolerance) https://www.ncbi.nlm.nih.gov/snp/?term=rs4988235 • SLC24A5 - rs1426654 (light skinned, http://exac.broadinstitute.org/variant/15-48426484-AEuropean) G • EDAR, EDA2R - rs3827760 (hair development, Asians) http://exac.broadinstitute.org/variant/2-109513601-A-G • HBB – rs334 (sickle cell) http://exac.broadinstitute.org/variant/11-5248232-T-A Bonus slides Mutation selection balance Mutation-selection balance: rate at which deleterious mutations arise by mutation is equal to the rate at which they are removed by selection AA: fitness (w11) = 1 AB: fitness (w12) = 1-hs BB: fitness (w22) = 1-s w = p2w11+2pqw12+q2w22 h: dominance effect (between 0 and 1) s: selection coefficient A mutates to B at rate μ P(A) = p, P(B) = q Frequency of allele A in next generation is controlled by both mutation and selection: p’ = (p2w11+pqw12)/w(1-μ) At equilibrium state, p’ = p. Solving gives: q =μ/hs (partial dominance, h>0) q =sqrt(μ/s) (completely recessive, h=0) Thus: high s (very deleterious) pushes the allele frequency of the deleterious allele (B) to low frequency. Modeling genetic diversity – Coalescent Theory Ancestral population Most recent common ancestor (MRCA) Descendants for each generation are chosen at random with replacement from current generation Key of coalescent theory: analyze backwards, rather than forward, in time. Focus on ancestors of current population (e.g. only yellow). Extremely useful framework for simulating haplotype histories Present-day haplotypes http://www.csbio.unc.edu/mcmillan/index.py?run=Courses.Comp790S09 Modeling genetic diversity – Coalescent Theory Population size: 2N (N=number of samples) Coalescence after 1 generation: 1/2N P(allele in gen. t has parent in t-1) = 1 P(another allele in gen. t has same parent) = 1/2N Coalescence after 2 generations: 1/2N(1-1/2N) Coalescence after t generations: 1/2N(1-1/2N)t-1 Coalescent time of two gene copies follows a geometric distribution with mean 2N For k gene copies: 4N(1-1/k) Coalescence with mutation Mutations happen along the branches of the tree with at some rate μ per generation. For a pair of sequences: MRCA t generations # mutations between sample 1 and sample 2 = 2tLμ Sample 1 Sample 2 Where L is the length of the gene sequence Applications of coalescent with mutation: • Infer population genetics parameters e.g. mutation rate, from sequence • Infer effective population size • Test for selection by comparing observed vs. expected variation More resources on coalescent theory Textbooks • Principles of Population Genetics. Daniel L Hartl & Andrew G. Clark • Evolutionary Theory. Sean H. Rice Simulation tools • Fastsimcoal • simcoal2 • Ms • Msms • Many others…