Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Single Nucleotide Polymorphism Linkage Disequilibrium And Haplotypes Xiaole Shirley Liu Outline • Definition and motivation • SNP distribution and characteristics – Allele frequency, LD, population stratification • SNP and genotyping • Haplotype inference: – Clark’s algorithm – EM and Gibbs sampling – Hapmap project and 1000 Genomes 2 STAT115 Polymorphism • Polymorphism: sites/genes with “common” variation, less common allele frequency >= 1%, otherwise called rare variant and not polymorphic • Single Nucleotide Polymorphism – Come from DNA-replication mistake individual germ line cell, then transmitted – ~90% of human genetic variation • Copy number variations – May or may not be genetic 3 STAT115 Why Should We Care • Disease gene discovery – Association studies, e.g. certain SNPs are susceptible for diabetes – Chromosome aberrations, duplication / deletion might cause cancer • Personalized Medicine – Drug only effective if you have one allele 4 STAT115 SNP Distribution • Most common, 1 SNP / 100-300 bp – Balance between mutation introduction rate and polymorphism lost rate – Most mutations lost within a few generations • 2/3 are CT differences • In non-coding regions, often less SNPs at more conserved regions • In coding regions, often more synonymous than non-synonymous SNPs 5 STAT115 SNP Characteristics: Allele Frequency Distribution • Most alleles are rare (minor allele frequency < 10%) 6 STAT115 SNP Characteristics: Linkage Disequilibrium • Hardy-Weinberg equilibrium – In a population with genotypes AA, aa, and Aa, if p = freq(A), q =freq(a), the frequency of AA, aa and Aa will be p2, q2, and 2 pq respectively at equilibrium. – Similarly with two loci, each two alleles Aa, Bb 7 STAT115 SNP Characteristics: Linkage Disequilibrium • Equilibrium Disequilibrium • LD: If Alleles occur together more often than can be accounted for by chance, then indicate two alleles are physically close on the DNA – In mammals, LD is often lost at ~100 KB – In fly, LD often decays within a few hundred bases 8 STAT115 SNP Characteristics: Linkage Disequilibrium • Statistical Significance of LD – Chi-square test (or Fisher’s exact test) 2 – eij = ni. n.j / nT ( n e ) 2 ij ij eij i, j 9 B1 B2 Total A1 n11 n12 n 1. A2 n21 n22 n2. Total n.1 n.2 nT STAT115 SNP Characteristics: Linkage Disequilibrium • Haplotype block: a cluster of linked SNPs • Haplotype boundary: blocks of sequence with strong LD within blocks and no LD between blocks, reflect recombination hotspots 10 STAT115 SNP Characteristics: Linkage Disequilibrium • Haplotype block: a cluster of linked SNPs • Haplotype boundary: blocks of sequence with strong LD within blocks and no LD between blocks, reflect recombination hotspots • Haplotype size distribution 11 STAT115 SNP Characteristics: Linkage Disequilibrium • [C/T] [A/G] T X C [A/C] [T/A] – Possible haplotype: 24 – In reality, a few common haplotypes explain 90% variations • Tagging SNPs: Redundant – SNPs that capture most variations in haplotypes – removes redundancy 12 STAT115 SNP Genotyping • One SNP at a time or genome-wide (SNP array) 2.5kb 5.8kb 0.30 13 STAT115 40 Probes Used Per SNP • Allele call – AA, BB, AB • Signal – Theoretically 1A+1B, 2A, 2B – But could have 1A+3B Amplified! 14 STAT115 Haplotype • Haplotype: cluster of SNPs with LD – Block with 10 SNPs has 210 possible haplotypes – Only observe 5-6 haplotypes (> 90% cases) – Tagging SNPs: subset of SNP to ID a haplotype • Association (with disease) studies using haplotype is more accurate than using single SNP locus • Haplotype inference: Aa BB Cc 15 STAT115 Haplotype Inference • Genotyping only tells an individual is e.g. Aa BB Cc, but it doesn’t tell whether haplotype is: ABC + aBc, or ABc + aBC • Haplotype can often be inferred if parental genotype is known – Similar to blood typing, e.g. F: A, M: AB, C: B F: , M: , C: • Otherwise, look at the population genotypes, infer common haplotypes 16 STAT115 Haplotype Inference Clark’s Algorithm 1. 2. 3. 4. 17 Construct haplotypes from unambiguous individuals Remove samples that can be explained as combinations of haplotypes discovered already Propose haplotype that would explain most remaining Iterate 2 & 3 until finish STAT115 Haplotype Inference Clark’s Algorithm 1. 2. 3. 4. Construct haplotypes from unambiguous individuals Remove samples that can be explained as combinations of haplotypes discovered already Propose haplotype that would explain most remaining Iterate 2 & 3 until finish • Disadvantages: • • 18 Depend on # of ambiguous subjects Cannot get started when n is small STAT115 EM and Gibbs Sampling in Motif Finding • Problem – Observe: sequence S – Unknown: motif θ and site location A (alignment), but given one, can infer the other • EM and Gibbs Sampler – Initialize random motif θ – Iterate: • Given θ and sequence S, update site location A • Given A and S, update θ – EM updates by weighted average – Gibbs sampling updates by sampling 19 STAT115 Statistical Model for Haplotype Haplotype T T T T T T T T T T T T T T T T A A A A C C C C C C G G C C G G Frequency C G C G C G C G ----------------- 1 2 3 4 5 6 7 8 Haplotype Pool 2 1 4 8 2 6 3 6 6 5 7 6 1 1 • Each individual’s two haplotypes are treated as random draws from a pool of haplotypes with certain frequencies that can satisfy the genotyping 20 STAT115 Haplotype Inference EM and Gibbs Sampler • Observe genotype Y, estimate haplotype pair Z for each individual and haplotype frequency • Initialize haplotype frequencies • Iteration: – Estimate Z given Y, – Estimate given Y, Z 21 STAT115 Haplotype Inference EM and Gibbs Sampler • Observe genotype Y, estimate haplotype pair Z for each individual and haplotype frequency • Initialize haplotype frequencies • Iteration: – Estimate Z given Y, – Estimate given Y, Z 22 STAT115 Haplotype Inference Partition-Ligation • When #SNP is big, # possible haplotypes is too big, so divide and conquer – Consider an inferred sub-haplotype as one allele 23 STAT115 Hapmap of Human Genome • HapMap: catalog of common genetic variants in human – What are these variants – Where do they occur in our DNA – How are they distributed within populations and between populations around the world • Goals: – Define haplotype “blocks” across the genome – Identify reference set of SNPs: “tag” each haplotype – Enable unbiased, genome-wide association studies 24 STAT115 1000 Genomes Projects • Characterization of human genome sequence variation • Foundation for investigating the relationship between genotype and phenotype 25 STAT115 Summary • SNP and CNV • SNP distribution and characteristics – Allele frequency (minor allele > 1%) – LD: linkage ~ physical proximity – Population stratification • SNP genotyping: SNP arrays, sequencing • Haplotype inference – Clarks: resolve unambiguous first, propose new haplotypes to maximize explanation – EM & Gibbs: iteratively infer haplotype frequency and individuals’ haplotypes 26 STAT115 Acknowledgement • Stefano Monti • Jun Liu & Tim Niu • Kenneth Kidd, Judith Kidd and Glenys Thomson • Joel Hirschhorn • Greg Gibson & Spencer Muse • Cheng Li & Yuhyun Park 27 STAT115