* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Linkage analysis the basic concepts
Genetic drift wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Genealogical DNA test wikipedia , lookup
Genetic testing wikipedia , lookup
Medical genetics wikipedia , lookup
Dominance (genetics) wikipedia , lookup
Hardy–Weinberg principle wikipedia , lookup
Population genetics wikipedia , lookup
Fundamental Concepts in Gene Mapping (BIO227) Linkage analysis: The basic concepts Review of Mapping Strategies What is the biological basis of Linkage Analysis? What parameter do we test in linkage analysis? What is the null hypothesis? If we reject the null, what do we conclude? If the recombination parameter θ between 2 loci is 0.10, are these 2 loci linked? What is the distance between them in centimorgans? What is it in base pairs? Would you expect these two loci to be in LD? Learning Objectives for Today The basic concepts and principles of linkage analysis: Making Genetic Maps Disease Mapping Direct Counting Method (for Mendelian Disorders) Affected Sib Pairs for NonParametric Linkage Testing for the presence of linkage and LOD scores Very Big Overview Get phenotype data from families Cover genome with markers, say 440 or 4,000,000 Test every marker AND every location in between markers Convert test statistic to a LOD score If LOD exceeds 3 (more or less), declare linkage in the region Role of Linkage Analysis in Gene Mapping General Usage: determine ‘genetic distance’ between 2 or more loci; genetic distance determined by θ MAKE GENETIC MAPS --First Human Genetic Map was completed in 1987 with 440 markers organized into 23 linkage groups. Locations between markers determined via linkage LOCATE DISEASE GENES --Locate distance between hypothetical DSL and a known marker(s) on the map. --Genes for around 2-3000 Mendelian Disorders were found using linkage; was not very successful for complex disorders. --Finding new Mendelian Disorders using sequence analysis of families Constructing Genetic Maps θ = P(recombination between 2 loci) The recombination fraction increases with the physical distance between loci. => Recombination fraction can be used to measure relative distances on the chromosome. Morgan is the unit of measure θ = distance in Morgans θ100 = distance in cMs How do we estimate P(recombination) from data? Look at transmissions from parents to offspring, count recombinations. Probability of haplotype transmission for two biallelic loci (aA locus and bB locus) Possible Haplotypes Transmitted to Offspring: Assume phase known Possible parental diplotypes a b a B A b A B aa bb aa bB aA bb aA bB aA Bb P(transmission) depends on θ ONLYwith double het parents Probability of haplotype transmission for two biallelic loci (aA locus and bB locus) Possible Haplotypes Transmitted to Offspring All possible parental diplotypes ab aB Ab AB ab|ab 1 0 0 0 ab|aB ½ ½ 0 0 (1- θ)/2 θ /2 θ /2 (1- θ)/2 aB|aB ab|Ab ab|AB aB|Ab P(offspring is recombinant| double het parent) = θ. A Simple example, Linkage between ABO locus and AK1 Locus How do we estimate θ (and the genetic distance)? METHOD 1: Direct Counting Method: Assume data at two markers on parents and offspring. Identify haplotype transmissions from each double heterozygote parent to each of their offspring for the two loci. Count recombinant haplotypes in the offspring for the two loci Use resulting data for estimation, testing. AA 11 BB 22 CC 33 CC 33 AB 12 AC 13 AC 23 CC 33 BC 23 CEPH families BC 13 AC 13 Direct counting method using a sample of families Unit of analysis is pairs (meioses) of a double heterozygous parent and offspring; gives sample size Random variable is the transmission from each parent to offspring: Z_i=1 if recombinant, 0 otherwise (i indexes double het parent-child pair) Let r denote the sum of Z_i; it counts the number of offspring-parent pairs where the transmitted haplotype is a recombinant s denotes the number of offspring-parent pairs where the transmitted haplotype is a non-recombinant. Total number of informative transmissions is n = r+s; equals the number of double het parent-child pairs Direct counting method Principle: Transmissions from different parents are independent and transmissions to different offspring are independent. Pr(recombinant) = θ is same for every pair The distribution of r is what? How do we specify the null hypothesis of no linkage? How do we estimate θ given r and n? How do we test for linkage? (any number of ways) Direct counting method Inference about θ: r is Binomial (n,θ) • • • p(r) = nCr θr(1-θ)(n-r) r=0,…n θ^ = max(r/n,1/2) To test H0: θ = ½, use Likelihood Ratio Test (LRT) Likelihood ratio test compares p(r) as a function of θ to p(r) when θ = 1/2 In general, inference is complicated by fact that θ is constrained to be < ½ AA 11 BB 22 CC 33 CC 33 AB 12 AC 13 AC 23 CC 33 BC 23 CEPH families BC 13 AC 13 Autosomal dominant inheritance: disease status is observed, but DSL alleles are not. Marker locus with alleles M,m is observed We assume • Complete penetrance, no phenocopies • Dd or DD=affected and dd=unaffected Step 1: Infer disease genotype and missing markers D? ?? dd mm D d Mm dd mm Step 2: Infer phase and the informative meioses Step 3: Count the number of recombinants and nonrecombinants What happens if grandmother’s marker data are missing, or if she is mM? dD dD dD dd dd mm mM mM mm mm r= s= θ^hat =0.2 Problems with Parametric Analysis We assume that • Complete penetrance • DD,Dd=affected and dd=unaffected Step 1: Consider both phases Step 2: Identify informative meioses under each phase Step 3: Count the number of recombinants and nonrecombinants under each phase Step 4: Combine over phases Both grandparents genotype missing? Cannot determine phase D? D,d M,m dd - dd mm d D d D d D dd dd m m Mm mM mm mm 1 0 0 0 0 r=1 r= s=4 s= What shall we do? What is P(phase 1) and P(phase 2) θ=0.2 θ =0.8? Handling missing phase in parent • P(r) is B(n, θ) if phase is known; for other phase, s is B(n, θ) • If know P(phase) can compute p(r) as P(r) = P(r|phase 1)P(phase 1) + P(r|phase 2)P(phase 2) • P(phase) = ½ Why? P(r) = ½nCrθr(1-θ)s + ½nCsθs(1-θ)r = ½nCr{θr(1-θ)s + θs(1-θ)r} Can be used to estimate θ or a LR test or LOD score, but simple chi-square tests no longer apply. Complications with parametric analysis Recessive model calculations are difficult—genotype often not possibe to infer Suppose incomplete penetrance? Suppose phenocopies? dd mm Unaffected could be dd, Dd or DD Affecteds could be dd Penetrance functions often depend upon age for complex disorder Results can be very misleading if choose wrong penetrance function (rely on segregation analysis) Likelihood gets very difficult to enumerate, especially with complex pedigrees; have to consider all possible genotypes and all possible phases Led to increased emphasis on Nonparametric methods D d Mm dd mm dD dD dD dd dd mm mM mM mm mm Likelihood inference for θ: LOD Definition: Likelihood of the data is proportional to the probability (or density) function. A likelihood ratio test to test H0 vs HA uses LR = L(under alternative)/L(under null) L(under alternative) depends on unknown θ. So, choose a value of θ which maximizes the likelihood under the alternative; maximizes LR LRT = 2 ln {max LR} = 2 ln {max L(under alternative)/L(under null)} When H0 is true, in general likelihood ratio test is approximately chi-square on 1 df. Because of constraint, LRT is not chi-square in this case. LOD score used for testing 21/53 Inference about Linkage: LOD Score Definition: Log (base 10) of LR(θ) LOD(θ) = log10 LR(θ) LR is a measure of support for a value θ relative to the null value(1/2); note that LOD is a function of the unknown θ LOD of 1 says P(data for θ) is 10 times what it is for θ = ½. LOD of 2 says P(data for θ) is 100 times what it is for θ = ½. Use maximized LOD score >3 to reject H0. LOD score can be negative Several advantages over LRT: easier to combine over families, easy to compare different markers Combining LODs from multiple families Have K independent families LR is product over families: – • LR() = LRfam1() LRfam2() LRfam3() … …so lods is sum over families – lods() = lodsfam1() + lodsfam2() + lodsfam3() … Can calculate lods for each family separately at each value of , then add NOT true for LRT • Example: – – – – – Family 1 has r=2 and n=5 Family 2 has r=1 and n=6 Family 3 has r=0 and n=3 Family 4 has r=2 and n=8 r_tot = 5 n_tot = 22 LRTtot(5/22) ≠ LRTfam1(2/5) + LRTfam2(1/6) + LRTfam3(0/3) + LRTfam4(2/8) Finding genes for Mendelian Disorders was a sequential process; LOD scores convenient way to report results 1 0 -1 lods -2 Fam 1 Fam 2 Fam 3 Fam 4 All fams 0.0 0.1 0.2 0.3 theta 0.4 0.5 Relation between max LOD and LRT: How big is max LOD of 3? Max LOD = max(over θ)log_10 (LR) = log_10(max LR) LRT = 2 ln (max LR) LRT = 4.6 max LOD So at the ML of θ Max LOD > 3 => LRT > 13.8 very small p-value LRT Where does max LOD >3 originate? Many justifications: Sequential Analysis Multiple testing argument—Take a grid of linked markers over entire genome; test everywhere Can use properties of recombination to derive P(max LOD exceeds threshold | no linkage anywhere). Depends on threshold and and length of chromosome tested. Cannot do this with association testing! Summary: Direct Counting • General features of a parametric linkage method: – Mode of inheritance has to be specified (segregation analysis); was not so successful for complex disease – Could be seriously wrong if disease model is wrong. Really only successful for Mendelian diseases – Estimation of the recombination fraction, max LOD used for inference Linkage: Method 2 Nonparametric Analysis Nonparametric => Do not need to make assumption about disease model. Linkage analysis based on counting recombinations can be very inaccurate if genetic model is incorrect. Nonparametric is valid under H0, but power depends on model • Most approaches rely on using pairs of affected relatives and concept of sharing of markers between relatives: IBD or IBS • Intuition: If have a pair of affected relatives, then likely share a disease allele at the DSL, so at a linked marker, sharing the marker is also likely Alleles shared identical by descent and identical by state Allele sharing is defined between 2 individuals Each individual has two alleles, one from Mom and one from Dad. Thus the pair can share 0,1,2 identical by state (IBS) are those that are physically identical, i.e., both people have a T for an A or T snp, for example. identical by descent (IBD) must be IBS and also inherited from a common ancestor. Alleles that are IBD are also IBS, but not vice-versa. With IBD, shared alleles are exact copies. Examples of identity by state and identity by descent among 2 sibs ab cd ac IBS= IBD= bd ab cd ac IBS= IBD= ad ab cd ac IBS= IBD= ac Why we love polymorphic markers ab cb bc IBS= IBD= ab ? ? ab cc ac IBS= IBD= ac ? ? ab ab ab IBS= IBD= ab ? ? Can always tell IBS, but not always IBD; IBD ≤ IBS For now, we assume that IBD status is known (= perfect marker information).; will return to this problem later Nonparametric Analysis: Pairs of Affected Relatives (Use siblings) Basic Idea: Two affected siblings should share the same genetic material IBD at a DSL. Then, if the marker is close (linked) to the DSL, affected siblings will be sharing an ‘excess’ of alleles at the marker. Relatives who do NOT share affection status should share less Need to consider what we expect about sharing in the absence of disease but also what do we expect at the DSL. Distribution of I.B.D.-relationships under H0 Under the null-hypothesis: No linkage between the marker locus and the disease gene (θ=1/2): pk =Probability that two affected relatives share k alleles IBD at marker Type of relative pair p0 p1 p2 First cousin ¾ ¼ 0 Double first cousins 13/16 1/8 1/16 ½ 0 Monozygotic twins Full sibs Parent-offspring Grandparent–grandchild ½ Sharing IBD at the DSL: Recessive model (2 copies of D-allele=> affected) Parent 1 Parent 2 Parent 2 Parent 1 Parent 1 Parent 2 Disease locus sib 1 sib 2 sib 2 sib 1 sib 1 sib 2 1 DSL 2 Affected sibs 1 Unaffected 2 Unaffected sibs IBD = 2 at DSL IBD = 1 at DSL IBD = 0 at DSL Affected relative pair analysis Collect affected affected relative pairs (and other members of pedigree) Genotype all relative pairs of each pedigree and determine IBD for each pair Compute the IBD probabilities: (under null) p0 (=sharing 0 alleles) p1 (=sharing 1 alleles) p2 (=sharing 2 alleles) Estimate the IBD probabilities at the marker from the sample Construct test statistic that compares the IBD probabilities under the null hypothesis with observed IBD probabilities Affected sib pair analysis Have data on n affected sib pairs (n0, n1, n2) Compare the observed proportions with the IBD probabilities under the null hypothesis: p0=1/4 p1 =1/2 p2=1/4 Many Test statistics (simple ones are not easy to generalize when cannot tell IBD): MLS-methods (maximum likelihood) NPL-methods (score tests) LOD scores MLS methods Assumptions: • n affected sib pairs • Perfect marker information • pk =probability of sharing k alleles ibd. Likelihood function: Pro: handle missing IBD Con: need to test pattern of Sharing Number of Alleles Shared IBD 0 1 2 Tot al Observed n0 n1 n2 n Expected n/4 n/2 n/4 n Alternative Test for IBD Sharing: Nonparametric Score Test (NPL) Wi : number of alleles shared IBD in the ith pair μ = E(Wi|H0) = ? σ2 = Var(Wi|H0) = ? Z = (W_bar – μ)/ σ/√n is N(0,1) for large n when H0 is true. Reject if Z is too big because under the alternative there is excess marker sharing so that E(Wi) > E(Wi|H0) Example Example Last Topic: IBD Transition Probabilities Assume θ between 2 loci known •P(share j alleles at locus 2|share k at locus 1) •Can also get joint distribution of IBD1 and IBD2 and also can get P(IBD2) if I know P(IBD1) and θ •These transition probabilities hold, no matter what the allele sharing probabilities are at marker 1. The could be the null (1/4,1/2,1/4) or the marker A could be a DSL, with probabilities computed according to the mode of inheritence. Applications of Basic Principle Principle: Know IBD sharing at a locus, you can predict IBD sharing at some distance θ from the locus; 1) Power Analysis: Assume some disease model, calculate p(sharing at DSL), compute P(sharing at marker|DSL sharing) for different values of θ. Enables one to compute power for a given disease model, θ and n. 2) Incorporating pairs with incomplete information about IBD at a marker: Use data at adjoining markers Improve power with missing parents or with markers that are not polymorphic 3) Whole Genome Linkage Scans (Multi-Marker): H0: DSL not linked to any marker on genome HA: Evidence for linkage at least one locus Summary • What are the main weakness of parametric linkage analysis? • Is missing phase a weakness for nonparametric analysis • What is a major limitation of nonparametric analysis • What about non-Mendelian disorders? • With pedigrees, families can be analyzed separately • Concepts of IBD can be extended to handle rare variants in families