Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Slide 1 Linkage Disequilibrium Joe Mychaleckyj Center for Public Health Genomics 982-1107 [email protected] Joe Mychaleckyj Slide 2 Today we’ll cover… • • • • Haplotypes Linkage Disequilibrium Visualizing LD HapMap Joe Mychaleckyj Slide 3 References Principles of Population Genetics, Fourth Edition (Hardcover) by Daniel L. Hartl, Andrew G. Clark (Author) Genetic Data Analysis II Bruce S Weir QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. x Joe Mychaleckyj x x Slide 4 References Statistical Genetics: Gene Mapping Through Linkage and Association Eds Benjamin M. Neale, Manuel A.R. Ferreira, Sarah E. Medland, Danielle Posthuma Joe Mychaleckyj Slide 5 SNP1 [A / T] SNP2 [C / G] SNP3 [A / G] A A T C C G G A G Haplotype: specific combination of alleles occurring (cis) on the same chromosome (segment of chromosome) N SNPs - How many Haplotypes are possible ? 2N (ie very large diversity possible) Joe Mychaleckyj Slide 6 Terminology • Haplotype: Specific combination (phasing) of alleles occurring (cis) on the same chromosomal segment • Linkage/Linked Markers: Physical colocation of markers on the same chromosome • Diplotype: Haplogenotype ie pair of phased haplotypes one maternally, one paternally inherited Joe Mychaleckyj Slide 7 SNP1 [ A / a ] SNP2 [ B / b ] Major Allele Freq: p(A) p(B) Minor Allele Freq: p(a) p(b) Independently segregating SNPs: Haplotype Frequency p(ab) = p(a) x p(b) LINKAGE EQUILIBRIUM (How many haplotypes in total ?) LINKAGE DISEQUILIBRIUM Haplotype Frequency p(ab)≠ p(a) x p(b) Joe Mychaleckyj Slide 8 Linkage Disequilibrium • Non-random assortment of alleles at 2 (or more) loci • The closer the markers, the stronger the LD since recombination will have occurred at a low rate • Markers co-segregate within and between families Joe Mychaleckyj Slide 9 * LINKAGE EQUILIBRIUM * Not a Punnett Square! SNP2 Allele SNP1 Allele B b A p(A)p(B) p(A)p(b) p(A) a p(a)p(B) p(a)p(b) p(a) p(B) p(b) Example: p(A)p(B)+p(a)p(B)=p(B){ p(A)+p(a)} = p(B) Joe Mychaleckyj Slide 10 SNP1 [ A / a ] SNP2 [ B / b ] Major Allele Freq: p(A) p(B) Minor Allele Freq: p(a) p(b) LINKAGE DISEQUILIBRIUM Haplotype Frequency p(ab) = p(a) p(b) + D (sign of D is generally arbitrary, unless comparing D values between populations or studies) D: Lewontin’s LD Parameter (Lewontin 1960) Joe Mychaleckyj Slide 11 * LINKAGE DISEQUILIBRIUM * SNP1 Allele SNP2 Allele B b A p(A)p(B)+D p(A)p(b)-D p(A) a p(a)p(B)-D p(a)p(b)+D p(a) p(B) p(b) p(A)p(B)+D + p(a)p(B)-D =p(B){ p(A)+p(a)} = p(B) Joe Mychaleckyj Slide 12 b a A 0.16 0.14 B 0.04 0.66 p(a)=0.20 p(B)=0.80 What is the LD ? ≠0 p(ab) ≠ p(a) p(b) p(b)=0.30 p(B)=0.70 p(ab) = p(a) p(b) + D 0.16 = 0.2 x 0.3 + D D = 0.1 Since p(ab) = p(a)p(b)+ D +D was used and D is +ve here, but arbitrary eg can relabel alleles A,B as minor Joe Mychaleckyj Slide 13 Range of D values (-ve to +ve) D has a minimum and maximum value that depends on the allele frequencies of the markers Since haplotype frequencies cannot be -ve p(aB) = p(a)p(B) - D ≥ 0 D ≤ p(a)p(B) p(Ab) = p(A)p(b) - D ≥ 0 D ≤ p(A)p(b) These cannot both be true, so D ≤ min( p(a)p(B), p(A)p(b) ) p(ab) = p(a)p(b) + D ≥ 0 D ≥ -p(a)p(b) p(AB) = p(A)p(B) + D ≥ 0 D ≥ -p(A)p(B) These cannot both be true, so D ≥ max( -p(a)p(b), -p(A)p(B) ) * Similar equations if we had defined p(ab) = p(a)p(b) - D Joe Mychaleckyj Slide 14 Limits of D LD Parameter Limits of D are a function of allele frequencies Standardize D by rescaling to a proportion of its maximal value for the given allele frequencies (D') D’ = D Dmax Joe Mychaleckyj Slide 15 D’ (Lewontin, 1964) D’ = D / Dmax Dmax = min (p(A)p(B), p(a)p(b)) D<0 Dmax = min (p(A)p(b), p(a)p(B)) D>0 Again, sign of D’ depends on definition D’ = 1 or -1 if one of p(A)p(B), p(A)p(b), p(a)p(B), p(a)p(b) = 0 = Complete LD (ie only 3 haplotypes seen) D’=1 or -1 suggests that no recombination has taken place between markers Beware rare markers - may not have enough power/sample size to detect 4th haplotype Joe Mychaleckyj Slide 16 D’ Interpretation b B b B a 0.06 0.14 p(a)=0.20 a 0.2 0 p(a)=0.20 A 0.24 0.56 p(A)=0.80 A 0.1 0.7 P(A)=0.80 p(b)=0.30 p(B)=0.70 D=0 ; Dmax undefined p(b)=0.30 p(B)=0.70 D=Dmax =0.14 ; D’ = +1 p(a) = 0.2 D’=1 (perfect LD using D’ measure - No recombination between marker - Only 3 haplotypes are seen Joe Mychaleckyj p(b)= 0.3 Slide 17 Creation of LD • Easiest to understand when markers are physically linked • Creation of LD – – – – – – – Mutation Founder effect Admixture Inbreeding / non-random mating Selection Population bottleneck or stratification Epistatic interaction • LD can occur between unlinked markers • Gametic phase disequilibrium is a more general term Joe Mychaleckyj Slide 18 SNP1 SNP1 SNP2 A B A n=3 haplotypes Recombination n=2 haplotypes a A b a B SNP1 SNP2 A B A b a B a b n=4 haplotypes Joe Mychaleckyj Slide 19 Destruction of LD • Main force is recombination • Gene conversion may also act at short distances (~ 100-1,000 bases) • LD decays over time (generations of interbreeding) Joe Mychaleckyj Slide 20 SNP1 SNP2 Probability Recombination occurs = θ Probability Recombination does not occur = 1-θ Initial LD between SNP1 - SNP2: D0 After 1 generation Preservation of LD: D1 = D0(1-θ) After t generations: Dt = D0 (1- θ)t NB: Overly simple model does not account for allele frequency drift over time Joe Mychaleckyj Slide 21 Dt = D0 (1-θ)t QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Joe Mychaleckyj Slide 22 r2 LD Parameter (Hill & Robertson, 1968) r 2 = D2 p(a)p(b)p(A)p(B) • Squared correlation coefficient varies 0 - 1 • Frequency dependent • Better LD measure for allele correlation between markers - predictive power of SNP1 alleles for those at SNP2 • Used extensively in disease gene or phenotype mapping through association testing Joe Mychaleckyj Slide 23 r2 Interpretation b B b B a 0.06 0.14 p(a)=0.20 a 0.2 0 p(a)=0.20 A 0.24 0.56 p(A)=0.80 A 0.1 0.7 p(A)=0.80 p(b)=0.30 p(B)=0.70 p(b)=0.30 p(B)=0.70 D=0 ; Dmax undefined D=Dmax =0.14 ; D’ = +1 r2 = 0 r2 = 0.14/0.24 = 0.58 p(a) = 0.2 p(b) = 0.3 r2 ≠ 1 Correlation is not perfect, even though D’ = 1 r2 = 1 if D’ = 1 and p(a) = p(b) = 0.3 Joe Mychaleckyj Slide 24 r2 Interpretation p(a) = 0.3 p(b) = 0.3 Only 2 haplotypes: r2 = 1 Correlation is perfect D’ =1 (less than 4 haplotypes) p(a) = p(b) (= 0.3 in this example) • r2=1 when there is perfect correlation between markers and one genotype predicts the other exactly – Only 2 haplotypes present • D’ = 1 ≠> r2 = 1 • No recombination AND markers must have identical allele frequency – SNPs are of similar age • Corollary – Low r2 values do not necessarily = high recombination – Discrepant allele frequencies Joe Mychaleckyj Slide 25 Common Measures of Linkage Disequilibrium -1 D’ 1 Recombination 0 r2 1 Correlation Other LD Measures exist, less common usage Joe Mychaleckyj Slide 26 Visualizing LD metrics Joe Mychaleckyj Slide 27 SNP 1 2 | D’ | 1.0 0.8 0.6 0.2 0 3 4 5 6 SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 Not usually worried about sign of D’ Joe Mychaleckyj Slide 28 Joe Mychaleckyj Slide 29 Haploview: TCN2 (r2) Joe Mychaleckyj Slide 30 http://www.hapmap.org Launched October 2002 Joe Mychaleckyj Slide 31 International HapMap Project • Initiated Oct 2002 • Collaboration of scientists worldwide • Goal: describe common patterns of human DNA sequence variation • Identify LD and haplotype distributions • Populations of different ancestry (European, African, Asian) – Identify common haplotypes and population-specific differences • Has had major impact on: – Understanding of human popualtion history as reflected in genetic diversity and similarity – Design and analysis of genetic association studies Joe Mychaleckyj Slide 32 HapMap samples • 90 Yoruba individuals (30 parent-parent-offspring trios) from Ibadan, Nigeria (YRI) • 90 individuals (30 trios) of European descent from Utah (CEU) • 45 Han Chinese individuals from Beijing (CHB) • 44 Japanese individuals from Tokyo (JPT) Joe Mychaleckyj Slide 33 Project feasible because of: • The availability of the human genome sequence • Databases of common SNPs (subsequently enriched by HapMap) from which genotyping assays could be designed • Development of inexpensive, accurate technologies for highthroughput SNP genotyping • Web-based tools for storing and sharing data • Frameworks to address associated ethical and cultural issues Joe Mychaleckyj Slide 34 HapMap goals • Define patterns of genetic variation across human genome • Guide selection of SNPs efficiently to “tag” common variants • Public release of all data (assays, genotypes) • Phase I: 1.3 M markers in 269 people 1 SNP/5kb (1.3M markers) Minor allele frequency (MAF) >5% • Phase II: +2.8 M markers in 270 people Joe Mychaleckyj Slide 35 http://www.hapmap.org/ Joe Mychaleckyj Slide 36 Joe Mychaleckyj Slide 37 Joe Mychaleckyj Slide 38 HapMap publications • The International HapMap Consortium. A Haplotype Map of the Human Genome. Nature 437, 1299-1320. 2005. • The International HapMap Consortium. The International HapMap Project. Nature 426, 789-796. 2003. • The International HapMap Consortium. Integrating Ethics and Science in the International HapMap Project. Nature Reviews Genetics 5, 467 -475. 2004. • Thorisson, G.A., Smith, A.V., Krishnan, L., and Stein, L.D. The International HapMap Project Web site. Genome Research,15:1591-1593. 2005. Joe Mychaleckyj Slide 39 ENCODE project • Aim: To compare the genome-wide resource to a more complete database of common variation—one in which all common SNPs and many rarer ones have been discovered and tested • Selected a representative collection of ten regions, each 500 kb in length • Each 500-kb region was sequenced in 48 individuals, and all SNPs in these regions (discovered or in dbSNP) were genotyped in the complete set of 269 DNA samples Joe Mychaleckyj Slide 40 Comparison of linkage disequilibrium and recombination for two ENCODE regions Nature 437, 1299-1320. 2005 Joe Mychaleckyj Slide 41 LD in Human Populations Joe Mychaleckyj Slide 42 Haplotype Blocks N SNPs = 2N Haplotypes possible, ie very large diversity possible But: we do not see the full extent of haplotype diversity in human populations Extensive LD especially at short distances eg ~20kbases. Haplotypes are broken into blocks of markers with high mutual LD separated by recombination hotspots Non-uniform LD across genome Joe Mychaleckyj Slide 43 Haplotype Blocks Haplotype blocks: at least 80% of observed haplotypes with frequency >= 5% could be grouped into common patterns Whole Genome Patterns of Common DNA Variation in Three Human Populations, Science 2005, Hinds et al. Joe Mychaleckyj Slide 44 Length of LD spans r2 Joe Mychaleckyj Slide 45 Example: Large block of LD on chromosome 17 Cluster of common (frequent SNPs In high LD) 518 SNPs, spanning 800 kb 25% in EUR, 9% in AFR, missing in CHN Genes: Microtubule-associated protein tau Mutations associated with a variety of neurodegeneartive disorders Gene coding for a protease similar to presenilins Mutations result in Alzheimer’s disease Gene for corticotropin-releasing hormone receptor • Immune, endocrine, autonomic, behavioral response to stress Joe Mychaleckyj Slide 46 Chromosome 17 LD Region Prevalent inversion in EUR human population ~25% Joe Mychaleckyj