* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download No Slide Title
Nutriepigenomics wikipedia , lookup
Designer baby wikipedia , lookup
SNP genotyping wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
Pharmacogenomics wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Human leukocyte antigen wikipedia , lookup
Fetal origins hypothesis wikipedia , lookup
Public health genomics wikipedia , lookup
Microevolution wikipedia , lookup
Population genetics wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Genetic drift wikipedia , lookup
Gene mapping by association 3/4/04 Biomath/HG 207B/Biostat 237 Linkage has its limits To determine that a trait is closer to marker 1 than marker 2, we need to see recombination between marker 2 and the trait locus. As distance between the markers decreases the number of informative meioses needed to see recombination increases. At some point Linkage analysis because impractical because too many families are needed. Association Studies • Association is a statistical term that describes the co-occurrence of alleles or phenotypes. • An Allele A is associated with disease D, if people with D have a different frequency of A than people without D. Possible causes for allelic association • best: allele increases disease susceptibility D – candidate gene studies • good: some subjects share common ancestor – linkage disequilibrium studies Loci Alleles Under linkage equilibrium P(A,D)=P(A)*P(D) Violation of the equality is termed linkage disequilibrium M A1 K D Linkage Disequilibrium Suppose one of the population founders carries an allelic variant that increases risk of a disease. The disease gene is very close to a marker so is very small. d D A a d d A a . . . d d A a Ancestral haplotypes d d are dA, da, and Da A a Note that D is associated d d D d d d with a. A A a A A A P(a|D) is close to one. Over many generations (n), there is occasionally recombination between the two genes. So that the population looks like: d d d d d d D d D d D A a A A A A a A a A A d d d A A A The degree of association between D and a has decreased P(a|D) but still P(a|D) > P(a). P(a D)>p(a)P(D) The Degree of Association Between Two Genes Depends on the Distance Between them and the Age of the Population 1. Let aD = P(aD)-P(a)P(D) and similarly for other alleles. aD(n) = aD(0)(1-)n 2. At linkage equilibrium P(a/a|D/d)=P(a/a|d/d)=P(a/a|D/D)=P(a/a) P(A/a|D/d)=P(A/a|d/d)=P(A/a|D/D)=P(A/a) P(A/A|D/d)=P(A/A|d/d)=P(A/A|D/D)=P(A/A) Violation of these equalities is evidence of linkage disequilibrium. Allelic association studies test whether alleles are associated with the trait • 2 types of association tests – population-based association test • cases and controls are unrelated • cross-classify by genotype • use 2 test or logistic regression – family-based association tests • cases and controls are related: parents, sibs etc • often based on allele transmission rates • prime example TDT Mapping Genes using a Case Control Design 1. Example: Non insulin dependent diabetes in Pima indians is associated with human immunoglobulin gene, Gm. (Knowler et al., 1988) Cases genotype 1/1 or 1/2 23 (.0169) 1343(.983) 2/2 1366 Controls 270(.0760) 3284(.924) 3554 Total 293 4627 4920 2 =61.6 p <0.00005. 2. What can go wrong? Association could be due to ethnic differences among cases and controls population stratification. A Dramatic Example of when Association is due to Population Stratification 1. The Gm genotype differs by degree of caucasian heritage Genotype 1/1 or 1/2 2/2 >50% 184 (.441) 233 417 2 =1185.5 p <0.00005 2. <50% 109 (.0242) 4394 4503 Total 293 4627 4920 Diabetes prevalence differs by caucasian heritage Diabetes >50% <50% Total Yes 20 (.0146) 1346 (.112) 1366 No 397 3157 3554 417 4503 4920 3. Controlled for age and degree caucasian background, diabetes and Gm are not significantly associated. (Knowler et al., 1988). How concerned should we be about population stratification invalidating case/control results? 1. The allele frequencies and disease prevalence rarely differ as dramatically by race as in the example. 2. Good epidemiological methods can reduce the problem. Collect information on racial/ethnic background 3. Sometimes there is no alternative to a case/control design. Family controls may not be available. On the other hand, 1. Better safe than sorry - Family based control designs 2. Family based designs require more genotyping but not more phenotyping than case/control The Transmission Disequilibrium Test eliminates concern over false positives due to population stratification Spielman et al., 1993 Terwilliger and Ott, 1992 A simple illustration of the TDT: Collect parent-child trios If the child is chosen without regard to disease status A/a A/A then the child’s genotype is equally likely to be A/a or A/A However, if the child is chosen because they are affected and the marker allele a is associated with the disease allele D d D A a d d A A D d a A then the child is more likely to have the A/a genotype at the marker than the A/A genotype. Testing for Transmission Distortion (Disequilibrium) A biallic locus 1. Select individuals with the disease, genotype these individuals as well as their parents. 2. Determine how many heterozygous parents transmit A and how many transmit a. 3. Under the null hypothesis, the probability that a parent with the A/a genotype transmitted an A is ½. 4. Also under the null hypothesis, the maternal and paternal transmissions are independent. 5. In the case where there are only two alleles at the marker the test is equivalent to a McNemar test. Transmitted/ Not transmitted A a A a C11 C21 C12 C22 Test statistic T= (C21-C12)2/(C21+C12) For large samples and under the null hypothesis, T has a chisquare distribution (df=1) What are we testing with the TDT? A strongly positive result suggests that the marker tested is a trait susceptibility locus or that the marker is closely linked to a trait susceptibility locus. 1 H O : P A / a a P A / a A 2 1 H Alt : P A / a a 2 For a single affected child per family, the null and alternative hypotheses are equivalent to: 1 H O : or 0 2 1 H Alt : and 0 2 When more than one affected child per family is used, the TDT confounds linkage and association. Thus little is gained by running the TDT on a data set consisting of several very large pedigrees if linkage of the trait and marker has already been established. With many small unrelated pedigrees information on association can still be gained. The TDT has been extended to multiple alleles per locus Ho = transmission to affected child is not dependent on allele type Ha = transmission to affected child depends on allele type Allele transmitted 2 k-1 1 not-transmitted 1 2 . . . k-1 k ----C2,1 . . . Ck-1,1 Ck,1 t1 C1,2 . . . ----- . . . . . . Ck-1,2 . . . Ck,2 . . . t2 . . . k C1,k-1 C2,k-1 C1,k C2,k n1 n2 ----Ck,k-1 tk-1 Ck-1,k ------tk nk-1 nk ti represents the column sum omitting the diagonal term, ni the row sum also omitting the diagonal. Test statistics include k 1 ti ni Tm h k i 1 ti ni 2 k TDT 2 max 1i k ti ni 2 ti ni Mendel’s TDT1 is proportional to this statistic. Under some conditions, Tmh is asymptotically distributed as chi-square with degree of freedom k-1 Numerical example: data from a locus with 5 alleles. 120 transmissions from heterozygous parents to affected children. Allele 1 not-trans. 1 2 6 3 8 4 8 5 7 t 29 2 transmitted 3 4 5 n 6 --7 5 8 26 4 5 --5 7 21 5 4 5 6 --20 19 22 27 24 28 120 4 7 7 --6 24 Tmh = ? TDT2 = ? Is there evidence of transmission distortion? MENDEL determines significance using permutation procedures Why? If the sample size is small or alleles are rare, the TDT statistic distribution is poorly approximately by a chi-square distribution. How? (1) For each iteration (usually 10,000 or more) (a) Calculate a new TDT table. Hold the parental genotypes fixed. For each child, designate with equal probability that the child gets one of the parental alleles. (b) Calculate the TDT statistic and determine if larger than the observed TDT statistic. (2) The p-value is equal to the number of iterations in which the TDT statistic is larger than the observed divided by the total number of iterations. What is the reason for the standard error? Permutation p-values are estimated using Monte Carlo simulation with a finite number of iterations. TDT Summary • ignores transmissions from homozygous parents • with two alleles it has an approximate chisquare(1) distribution (McNemar test) – but exact p-values can be computed from the Binomial(p=.5) distribution in the bi-allelic case • If there is one affected per nuclear family this tests the null: no linkage or no association – If test is significant, there is linkage and association • If there are multiple affecteds, the TDT will confound linkage and association owing to the dependencies of the trios. – users should not expect new insight when the data consists of one or two large disease pedigrees already showing linkage – with many small unrelated pedigrees, the chance of confusing linkage with association becomes less of an issue, and the TDT can help in identifying associated marker alleles. Limitations of the original TDT (1) Nuclear Families (2) Qualitative traits (3) Codominant markers Many methods for extending the TDT have been developed. We will discuss one in detail, the gamete competition model. One way to extend the TDT: Lange (1988), Jin et al. (1994), and Sham and Curtis (1995) considered a model (Bradley Terry, 1952) that was originally used to predict to rank teams the outcome of team sports. How does the model work? Look at specific example: Suppose we are interested in predicting the outcome of a playoff game where the Diamond Back play the Dodgers. Or suppose we want to know the probability that Dodgers will be the National League West winners this year if we consider regular season results for last year? Suppose results are: winner D’backs Dodgers Giants Rockies Loser D’Backs --- Dodgers Padres 6 4 4 5 6 --- 7 5 4 Giants 8 5 --- 5 6 Rockies 8 7 7 --- 5 Padres 7 8 6 7 --- Let D’backs/Dodgers Dodgers denote the event that the D’backs and Dodgers play and the Dodgers win. In general for each team i, we assign a win parameter ti so that the probability that i beats j is: P(i / j i ) ti ti t j Bradley - Terry Model of Competing Sports Teams P(i / j i ) ti ti t j Note that multiplying each ti by any a>0 does not change its value, so one ti can be fixed at 1. We fix td’backs = 1. Note that if ti > tj for all j then i is the best team Let yij denote the number of times that i plays j and i wins. For example, the D’backs beat the Giants 8 times and the Giants beat the D’backs 4 times (yij = 8 and yji = 4). The win parameters can be determined using the following recurrence relationship t m 1 i y j i ij y ij y ji t j i t mj m i where the loglikelihood is ln(L) y ij ln t i ln( t i t j ) i j j RESULTS winner D’backs Dodgers Giants Rockies Loser D’Backs Dodgers Giants Rockies Padres --6 8 8 7 6 --5 7 8 4 7 --7 6 4 5 5 --7 Padres 5 4 6 5 --- Ho = all teams are equally likely to win (ti = 1 for all i) P(i / j i ) 1 2 LRT = 3.63, the p-value of 0.46 supports acceptance of the null hypothesis. We get more information from this analysis We get the relative rankings. tdodgers = 1.23, td’backs =1.00, tgiants = 0.87, trockies = 0.71, tpadres = 0.67 With these rankings we can make predictions about the outcomes of games: 1.23 .55 2.23 1.23 P( giants / dodgers dodgers) .59 2.10 P(dodgers / d ' backs dodgers) Note that these probabilities are different from the predictions if we just used the individual match up records. The estimate is not 8/12 =.67 for dodgers beating giants How is this sports analysis analogous to TDT? Think of : (1) Each possible allele at locus = a team (2) A heterozygous parent = a match up (3) Allele received by child from a heterozygous parent = the winner of the game (4) The transmission parameters = the win parameters (5) The win/lost record is determined by the transmissions from heterozygous parents. transmitted 1 not trans. 1 --2 6 3 8 4 8 5 7 2 6 --5 7 8 3 4 5 4 7 --7 6 4 5 5 --7 5 4 6 5 --- When we ignore disease status, the Bradley- Terry model provides a form of segregation analysis. When we consider the transmission to affected members only (like this example) we have a form of TDT analysis. The gamete competition likelihood for a pedigree The general form of the gamete competition likelihood for a pedigree with n individuals is L ... Pen( X i | Gi ) Prior (G j ) Tran(Gm | Gk , Gl ) G1 Gn i j {k ,l ,m} Here person i has marker phenotype Xi and underlying marker genotype Gi. For founders , Prior(Gj) For offspring, the transmission probability factors Tran(Gm | Gk, Gl)= Tran(Gmk | Gk )*Tran(Gml | Gl ) Tran(Gmk | Gk) = tmk/(tmk +tnk) and Tran(Gml | Gl) = tml/(tml +tnl) The penetrance, Pen(Xi| Gi) is always 1 or 0, depending on whether Xi and Gi are consistent or inconsistent Assessing significance We use a likelihood ratio test statistic LRT = 2*( ln(LHa)-ln(LHo) ) Where LHa and LHo are the maximum likelihoods under the alternative and null hypotheses. Significance? Approximate p-values can be calculated by assuming a the distribution is chi-square or by gene dropping. Gamete Competition contrasted with the TDT (1) Gamete Competition works on extended pedigrees No need to break up large families into nuclear families. (2) If have only trios, the gamete competition and the TDT are equivalent. Their null hypothesis is no linkage or no association. The alternative hypothesis is linkage and association. (3) When considering more than one affected per family, the TDT and gamete competition confound association with linkage. (4) Exact p-values can be determined with the TDT. Gamete competition p-values are asymptotic. (5) The gamete competition model can be used when there is missing marker information. Allele frequencies can be fixed at population estimates or estimated along with the t’s. (6) When there is missing data, the gamete competition is not immune to the effects of population stratification or rare alleles. Example:Families affected with Noninsulin Dependent Diabetes and linkage to a marker within the sulfonyl urea receptor-1 gene 27 Mexican-American extended pedigrees with 74 affected offspring (all genotyped) at SUR The likelihood ratio test statistic is 9.133 with 9 degrees of freedom. P-value =0.043 allele 1 freq .054 .288 ti se of .215 ti 2 3 4 5 6 7 8 9 10 .210 .190 .048 .047 .108 .140 .091 .071 .042 1.00 .810 1.40 .697 .383 .556 .567 .499 .082 fixed .447 .985 .681 .204 .288 .322 .509 .104 Can this model be extended to quantitative traits? Yes by recognizing that the Bradley – Terry Model is equivalent to a matched case control design. The transmitted allele is the case, the untransmitted allele is the control. ti e i x p where xp denotes child p’s standardized trait value, i denotes allele i and the probability of an i/j heterozygous parent transmitting i is P(i / j i) e e (i j ) x p (i j ) x p 1 Note that one is set to zero. This is equivalent to conditional logistic regression. Quantitative Trait Example: ACE High ACE concentration is associated with a deletion within an intron of the ace gene. 404 people in 69 families (Sinsheimer et al., 2000). e deletion xk P(insertion/ deletion deletion) e deletion xk 1 1 P(insertion/ deletion insertion) e deletion xk 1 P(insertion/ deletion deletion) P(insertion/ deletion insertion) 1.0 mle s.e. of mle insertion 0.00 fixed deletion 1.31 0.17 Ho: deletion = 0 Ha: deletion 0 LRT = 82.76 Asymptotic p-value < 1 x 10-19 Another Example: Analyzing tightly linked SNPs: SNPs (single nucleotide polymorphisms) tend to be more stable and more abundant than microsatellite markers. They are predominately biallelic, so we would like to use several tightly linked markers simultaneously to increase the overall information content. Recall that we use the allele transmissions from heterozygous parents. Assuming HWE, the maximum possible % of heterozygous parents for biallelic system is 0.50. For an n allele system, it is H=(n-1)/n. More alleles more information. The phase of these multilocus SNPs may not be known: Example: suppose there are three SNPs. An individual with multilocus genotype 1/2, 1/2, 1/2 could have one of the following haplotypes: (1) 111 and 222, (2) 122 and 211 (3) 121 and 212 or (4) 112 and 221. The gamete competition allows the use of noncodominant markers so we don’t need to determine which of these haplotypes combinations is present in a particular individual. For two linked loci associated with a quantitative trait, the transmission probability is expressed as: P(ij / kl ij ) x ij p (1 )e x x x ij x p kl p il p kj p (1 ) e e e e If we are using tightly linked SNPs, then is effectively zero and the transmission probability reduces to: x ij p e P(ij / kl ij ) x x ij p kl p e e An Example Again we use sex adjusted ACE levels as a quantitative trait. The three SNPs are labeled by their position and the nucleotides present at the position. A-240T, T1237C, and G2350A. Because the ACE gene spans only 26kb, the recombination fractions between these SNPs are effectively zero. The pedigree data consist of 83 white British families ranging in size from 4 to 18 members. ACE levels were determined on 405 family members. Genotypes were collected on 555 family members. In MENDEL, the most important difference from the previous example will be observed in the locus file. We need to allow for phase ambiguities (lack of certainty in haplotypes). L469 AUTOSOME 627 <-# haplotypes,# phenotypes ATA 0.40190 ATG 0.00780 ACA 0.06740 ACG 0.18310 T*A 0.01340 !T*A corresponds to haplotypes TTA and TCA T*G 0.32640 !T*G corresponds to haplotypes TTG and TCG We are no longer assuming co-dominant markers so we must specify the phenotype (of the marker) / genotype relationship. These phenotypes correspond to the marker phenotypes used in the pedigree file. RESULTS Haplotype Ho pijk s.e. of pijk Ha pijk s.e. of pijk ijk s.e. of ijk ATA ATG ACA ACG TTA+TCA .4052 .0079 .0676 .0257 .0045 .0136 .1839 .0199 .0133 .0059 .3321 .0240 .4019 .0256 .0000 fixed .1831 .0198 1.169 .2352 .0134 .0059 .2765 .5848 .3264 .0242 1.528 .2189 .0078 .0024 .2440 .9893 .0674 .0136 .2137 .4076 Log-likelihood under Ho = -704.34 Log-likelihood under Ha = -663.73 LRT=81.22 TTG+TCG df = 5 p-value = 4.67 x 10-16 Many other extensions / alternatives to the TDT have been developed. These include: TDT using sibling controls Sib-TDT (Spielman and Ewens, 1998) DAT (Boehnke and Langefeld, 1998) SDT (Horvath and Laird) TDT for quantitative traits Allison (1997), Rabinowitz (1997), Abecasis (2000) Joint modeling of linkage and association that allow estimation of recombination Hastabacka (1992) Kaplan, Hill and Weir (1995) Terwilliger (1995)