* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download 投影片 1 - Institute of Statistical Science, Academia Sinica
Survey
Document related concepts
Cre-Lox recombination wikipedia , lookup
Genetic testing wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Human genetic variation wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
Medical genetics wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Tay–Sachs disease wikipedia , lookup
Fetal origins hypothesis wikipedia , lookup
Genome (book) wikipedia , lookup
Genome-wide association study wikipedia , lookup
Microevolution wikipedia , lookup
Hardy–Weinberg principle wikipedia , lookup
Genetic drift wikipedia , lookup
Population genetics wikipedia , lookup
Dominance (genetics) wikipedia , lookup
Transcript
Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu Book reference • http://www.math.chalmers.se/Stat/Grundut b/Chalmers/TMS120/kompendium.pdf • Genetic Linkage Web Resource: http://linkage.rockefeller.edu/ 1 Introduction • Quality Trait: e.g. tall/short, green/yellow, affected/unaffected • Assume Genetic Model • parametric linkage analysis • lod score method • large pedigrees • No genetic model assumption • Nonparametric linkage analysis • Affected relative pairs Parametric vs. Non-parametric linkage analysis • Parametric – Assume genetic model known • Non-parametric – No assumptions about the genetic model • The parametric model is more powerful when the genetic model is correctly specified. • Problem size limitations – Parametric – large pedigrees, small number of markers – Non-parametric – small pedigrees, many markers Phenotype • Binary – affected or unaffected – Left handed or right handed • Affected, unaffected, and unknown – Unknown – possibly part of the syndrome • Quantitative – Insulin resistance – Blood Pressure Definitions • Locus – Position on a chromosome – Marker locus – Disease locus • Marker – A measurable unit on a chromosome – Dinucleotide repeat (CA)n – Single nucleotide polymorphism(SNP) • Allele – The measurement at a marker locus – 2 alleles per locus (one per chromosome) Marker alleles 1 and 4 Allelesat the disease locus A and a The recombination fraction Θ Θ = Probability of recombination between two loci. Θ = 0.5 if ”large” distance. Θ < 0.5 if ”short” distanc An odd number of crossovers = recombination An even number = no recombination Haldane’s Mapping function Recombination fraction – An example No! Recombination fractions are not additive for large distances. Penetrance( Gentic Model) • Probability of being affected • Penetrance parameters: f = (f0 f1 f2) Definition: fk = Probability of being affected if you have k disease alleles k=0, 1, 2. fk = P(affected conditional on k disease alleles) k=0, 1, 2. fk = P(affected | k disease alleles) k=0, 1, 2. Notation: A = Disease allele a = Normal allele Disease genotypes: aa, Aa, or AA Penetrance continued Recessive Dominant Full p. Reduced p. Full p. Reduced p. f0 = P(aff| aa) 0 0 0 0 f1 = P(aff | Aa) 0 0 1 0.8 f2 = P(aff| AA) 1 0.7 1 0.8 Dominant with phenocopies and reduced penetrance f0 = 0.01 Additive penetrances f0 = 0 f1 = 0.8 f1 = 0.4 f2 = 0.8 f2 = 0.8 Age dependent penetrances Population prevalence Kp = Proportion of affected individuals in a population = P(aff) aa Aa = Affected P(aff | aa) 0.03 P(aff Aa) P(aff | Aa) 0.12 P(Aa) P(aff | AA) 0.50 Definition of conditional probability AA Disease allele frequency p = 0.05 Assume that the population is in HWE P(aa) = (1-p)2 = 0.952 = 0.9025 P(Aa) = 2p(1-p) =0.095 Kp = P(aff) = ? P(AA) = p2 = 0.0025 Population prevalence contd. aa Aa AA Kp = Area of the red square / Total area (aa + Aa + AA) = = P(aff ∩ aa) + P(aff ∩ Aa) + P(aff ∩ AA) = = P(aff | aa)P(aa) + P(aff | Aa)P(Aa) + P(aff | AA)P(AA) = = f0*(1-p)2 +f1*2p(1-p) + f2*p2 = = 0.03*0.9025 + 0.12*0.095 + 0.50*0.0025 = 0.039725 The Law of Total Probability 0.04 Estimation of the genetic model • Segregation analysis – It is possible to estimate • • • • mode of inheritance number of loci contributing to a segregating phenotype. penetrance parameters Relative frequency (p) of the disease allele in the population – Problems? • Large population based samples required • Ascertainment bias • In parametric linkage analysis we assume that the genetic model is known. 2. Parametric two-point linkage analysis • Let q be the recombination freq between the diseased gene and the observed marker. – H0: q = 0.5 VS HA: q < 0.5 Estimation of the recombination fraction θ Example: N = 4 trios with affected mother and daughter Assume : that all the 12 individuals have been genotyped for a specific DNA marker that all the mothers are heterozygous at the marker locus that mothers and fathers have disease genotypes (Aa) and (aa), respectively that each daughter has inherited a disease allele from her mother that parental marker genotypes are not identical that the phase is known for all the mothers (unrealistic) Data : Trio 1-3: No recombination between marker and disease locus Trio 4: Recombination between marker and disease locus Estimate : θ* = 1/4 Estimation of θ continued • Assume that all meioses can be scored unequivocally as recombinant or non-recombinant with regard to a marker locus and a disease locus • n = Number of meioses • r = Number of recombinant meioses Estimate : θ* = r/n Estimates above 0.5 are not relevant from a biological point of view Definition: θ * = min(0.5, r/n) The binomial distribution The number of recombinants r among n independent meioses follows a binomial distribution. The probability of r recombinants out of n is a function of the recombination fraction θ. Let us denote this function L(θ). Note that L(θ) is the probability (likelihood) of the observed data if the recombination fraction is θ. The maximum likelihood estimate (MLE) of θ is the value θ* for which L(θ) reaches its maximum. MLE: θ*= r/n Lod score history • Score proposed by Haldane & Smith 1947 • Newton E. Morton analysed the distribution of the lod score statistic under various assumptions • Lod scores below -2 are generally accepted as significant evidence against linkage. – Common in replicating studies. Likelihood Ratio Test : 0 : x1 ,..., xn ~ f 0 vs A : x1 ,..., x N ~ f1 f1 ( x1 ,..., xn ) LN f 0 ( x1 ,..., xn ) LN B reject 0 Sequential probability ratio test T inf LN ( A, B ) LT A accept 0 LT B reject 0 P0 ( LT B ) Type I error P0 ( LT A) Type II error (1 - power) There is a neat approximat ion between , , A, B E01( LT B ) f 0 ( x1 ,..., xn ) 1(T n, LT B )dx1 ...dxn n 0 f1 ( x1 ,..., xn ) n 0 f 0 ( x1 ,..., xn ) 1(T n, LT B )dx1 ...dxn f1 ( x1 ,..., xn ) E11(T n, Ln B ) n 0 1 1 1 E11( LT B ) 1 Ln B B E11( LT A) E0 LT 1LT A A1 approximat e the ineq. by eq. B 1 A A 1 A B 1 , A BA B A More complicated situations • • • • Phase Unknown Marker or Disease gene homozygosity Reduced penetrane Varying penetrance – age, sex, phenotype, diagnostic uncertinty • • • • • Phenocopies Missing marker data Extended pedigrees Pedigree loops Multilocus genotypes Recessive mode of inheritance Prerequisites •Autosomal recessive inheritance •100% penetrance f0=f1=0, f2=1 •No phenocopies •Nuclear family typed for one informative marker •All four meioses are informative More complicated situations • Reduced penetrane • Varying penetrance – age, sex, phenotype, diagnostic uncertinty • • • • • Phenocopies Missing marker data Extended pedigrees Pedigree loops Multilocus genotypes Lod score assignment The pedigree likelihood contd. g = (G1, G2, G3, G4) in the recessive example. P(y|g) depends on the penetrance parameters f = (f0, f1, f2) P(g|θ) depends on disease and marker allele frequencies Ex: G1 in the recessive example: (1A|2a , 3A|4a) P(g|θ) = 2pq*2p1p2 for the father 2pq*2p3p4 for the mother θ2/4 for the affected daughter3 θ2/4 for the affecteddaughter4 P(g|q) • P(y|g): genetic model • P(g|q)=PP(gi) PP(gj|gFjgMj) – i means founder – j means non-founder – Genotypes g includes those of marker and disease genes – Missing data, multilocus markers… More on missing marker data • Good estimates of the allele frequencies necessary • Assuming a uniform allele frequency distribution is usually no good idea – Bias – See e.g. Ott (1999) • Allele frequencies for markers available on Websites. • Genotype say 50 unrelated controls from the same population – Possible to use also alleles from individuals in the study without introducing bias. Heterogeneity • Allelic heterogeneity – Ex: Different mutations in BRCA1 will lead to the same phenotype • Genetic heterogeneity – Only a proportion of the families in a study can be explained by one disease locus. – Test for heterogeneity • • • • Smith (1963) - The admixture test Implemented in HOMOG (a program in the LINKAGE package) Estimates the proportion of linked families Age-dependent penetrance contd. Assume that a 45 year old woman comes to the clinic. What is the odds that she is a disease gene carrier? Odds to be a diseasegene carrier indifferent age bands: <30 1:2 30-39 1:3 Penetrance if 40-49 1:8 aa: 0.0012 Aa: 0.0235 50-59 1:12 0.0235 : 150*0.0012 i.e. about 1:8 60-69 1:27 70-79 1:36 General pedigrees • The Elston-Stewart algorithm (1971) – Start at the bottom of the pedigree and solve the problem for each nuclear family. – The likelihood for each branch is ’peeled’ on the individual linking the sub-tree to the part of the pedigree Two-point vs. Multipoint Linkage • Two-point linkage analysis – Analyze marker-disease co-segregation one locus at a time • One two-point lod score for each marker • IBS-sharing of a marker allele might lead to false positive lod scores if possible look at haplotypes. • Multipoint (often sliding n-point) – Regard the marker positions as fixed – Vary the location (x) of the disease locus across each sub-map of n adjacent markers. – Compare each multilocus likelihood to a likelihood corresponding to ’x off the map’ ( θ = 0.5). Software • Jurg Otts website at Rockefeller University – http://linkage.rockefeller.edu/soft • For parametric linkage analysis – LINKAGE – FASTLINK – VITESSE Linkage Analysis II --Nonparametric IBS or IBD 14 42 The affected sibs have one allele in common (4), but the 4-alleles come from different parents. Definition: Two alleles are said to be identical by state (IBS) if they are of the same kind. If two alleles have the same ancestral origin they are said to be identical by descent (IBD) IBS-count: 1 IBS is a weaker concept than IBD Notation x A fixedlocus on the genome N = N(x) = The number of alleles shared IBD by an affected sib pair at locus x Let us first assume that x is the disease locus ASP linkage analysis • Collect affected sib pairs – How many depends on the genetic effect – Power calculations • Genotype all 4 members of each pedigree • Estimate the conditional IBD probabilities (z 0 , z1 , z 2 ) • Compare with the IBD probabilities under the null hypothesis of no linkage: z H0 (0.25, 0.5, 0.25) (Binomial) P(N = k) k=0, 1, 2 ? Possible parental disease locus genotypes aa aa Aa Aa x AA AA aa, aa aa, Aa aa, AA Aa, aa Aa, Aa Aa, AA AA, aa AA, Aa AA, AA The corresponding genotype probabilities under the assumption of HWE and independence between the parents are: q2 2pq p 2 q 4 2pq 2pq p 2 q 2 2 p q2 2pq 3 4p 2q 2 2p 3q p 2q 2 3 2p q p 4 This matrix is symmetric so it is sufficient to consider6 different mating types P(N = k) k=0, 1, 2 Mating type P(Ci) C1 aa,aa q4 C2 Aa,aa 4pq3 C3 Aa,Aa 4p2q2 C4 AA,aa 2p2q2 C5 AA,Aa 4p3q C6 AA,AA P4 P(N 0 | 2 aff sibs) P(2 affsibs | IBD 0)P(IBD 0) P(2 aff sibs) P(IBD 0) 0.25 Before we go on, remember the genetic model: Recessive disease with f = (0, 0, 1) 6 P(2 aff sibs | IBD 0) P((2 aff sibs IBD 0) | C i )P(Ci ) 1* p 4 p 4 i 1 Why? Because both affected sibs must have2 disease alleles and these pairs of alleles must be of different parental origin. ThusP((2 aff sibs| IBD=0)|Ci) = 0 for i = 1-5. Finally we calculate the denominator P(2 aff sibs). IBD probabilities for a few genetic models Table 2.1 page 30 in the compendium λs= Sibling relative risk = 0.25/z0 (strength of the genetic component) The Maximum Lod Score (MLS) Assumptions: n affected sib pairs Null hypothesis a marker at2a specific test locus x has been genotyped 1 perfect marker information 4 H0: Alternative H1: (N = N(x) known) ~ = (0.25, 0.5, 0.25) ~ 1 =4 (z 0, z1, z2) !=(0.25, 0.5, 0.25) (a fixed alternative) Pedigree number i: Ni = 2 The support for the alternative hypothesis is P(N i 2 | H1 ) Z2 LR i (x; ) 4Z2 P(N i 2 | H 0 ) 0.25 Ex: LR = 4 at the disease locus if z2=1 (recessive disease with full penetranceand no phenocopies) MLS continued Z0 0.25 4Z0 P(N i j | H1 ) Z1 LR i (x; ) 2 Z1 P(N i j | H 0 ) 0.5 Z2 0.25 4 Z 2 if j 0 if f 1 if f 1 Note: Both the observed IBD-count (j) and the IBD-probabilities Ψdepend on x. n affected sib pairs Combined evidence in favor of H1: # 0 IBD = n0= no(x) LR(x; ) LR 1 (x; ) * LR 2 (x; ) * ... * LR n (x; ) # 1 IBD = n1= n1(x) # 2 IBD = n2= n2(x) The LOD score (4Z 0 ) n0 (2Z1 ) n1 (4Z 2 ) n2 Base10 Z(x; ) log((4Z 0 ) n0 (2Z1 ) n1 (4Z 2 ) n2 n 0log(4Z 0 ) n1log(2Z 1 ) n 2log(4Z 2 ) MLS continued The maximum lod score = max Z(x; ) is known as the MLS-score The correspond ing to the MLS - score is the Maximum likelihood ˆ of . estimate n 0 /n ˆ n1/n n /n 2 the relative frequencie s Constrained maximization over Holman’s triangle leads to increased power. The derivation is more complicated under incomplete marker The MMLS-score is defined as the maximum of the MLS-scores over x. NPL Score • Example: Half Sib Pair Xij,t : indicator function for i-th pair shares j copy of IBD allele X1,t = SiXi1,t , l= recombination rate, t : trait locus P(Xi1,t |affected half sib)=(1+e-2l|t-t| )/2 Log-Likelihood = Xtlog(1+)+(N-Xt)log(1- Score Statistic for testing H0: 0 is X1,t For t unknown, we use maxtYt ,, Yt =X1,t Remark: Yt is a Markov Chain The NPL Score NPL = Non Parametric Linkage Before we define the score let us repeat the definitions of expectation and variance : 2 Expectatio n : μ N E(N) Expected value of N k * P(N k) k 0 Ex : Variance : EX : E(N) 0 * Z0 1* Z1 2 * Z 2 Z1 2Z 2 V(N) E((N - μ N ) 2 ) E(N 2 μ N 2 N μ N ) E(N 2 ) E (N) 2 2 V(N) 0 * Z0 1* Z1 4 * Z 2 ( Z1 2Z 2 ) 2 Under the null hypothesis of no linkage Under H 0 : (Z 0 , Z1 , Z 2 ) (0.25,0.5,0.25) E(N) z1 2z 2 0.5 2 * 0.25 1 V(N) E(N 2 ) E(N) 2 z1 4z 2 12 0.5 4 * 0.25 1 0.5 The NPL score continued Definition : Under H 0 : SD(N) Standard deviation of N V(N) σ N SD(N) 0.5 N -μN Standardiz ation : Z σN has expectatio n 0 and standard deviation 1. For the i : th sib pair define the NPL family score : 2 Zi 0 2 with probabilit y z 0 wit h probabilit y z1 with probabilit y z 2 Note: Ni -1 2 ( N i - 1) Zi 0.5 E(Zi) = 0 underH0 E(Zi) > 0 under H1 The NPL score at a locus x 1 Z(x) n Properties: n 2 Zi (x) (n 2 (x) - n 0 (x)) n i 1 E( Z(x) ) = 0 under H0 V( Z(x) ) = 1 under H0 Large NPL scores lead to rejection of H0 E( Z(x) ) > 0 under H1 E( Z(x) ) increases with the sample size under H1