Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Lecture 5: Segregation Analysis I Date: 9/10/02 Counting number of genotypes, mating types Segregation analysis: dominant, codominant, estimating segregation ratio Testing populations: polymorphism, heterogeneity, heterozygosity, allele frequency. Probability: The Need for Permutations and Combinations Often, particularly in genetics, the sample space consists of all orders or arrangements of groups of objects (usually genes or alleles in genetics). Permutations, combinations, and combinations with repetition exist to handle this elegantly. Probability: Permutation Definition: A permutation is the number of ways one can order r elements out of n elements. It is often written nPr and is calculated as n! n pr n r ! Example: How many different types of heterozygotes exist when there are l alleles and we distinguish order (e.g. paternal vs. maternal)? Probability: Combination Definition: A combination is the number of ways you can select r objects from n objects without regard to order. It is written as nCr and has value n n! n Cr r r!n r ! Example: How many different heterozygotes exist without regard to order when there are l types of alleles? Probability: Combination with Repetition Definition: Suppose there are n different types of elements and r are selected with replacement, then the number of combinations is given by C’(n, r) = n+r-1Cr. Examples: How many genotypes are possible when there are l alleles? How many mating types are possible when there are l alleles? Review: Segregation Ratio Recall that the law of segregation states that one of the two alleles of a parent is randomly selected to pass on to the offspring. Definition: The segregation ratios are the predictable proportions of genotypes and phenotypes in the offspring of particular parental crosses. e.g. 1 AA : 2 AB : 1 BB following a cross of AB X AB. Segregation Ratio Distorition Definition: Segregation ratio distortion is a departure from expected segregation ratios. The purpose of segregation analysis is to detect significant segregation ratio distortion. A significant departure would suggest one of our our assumptions about the model wrong. Segregation Analysis: What it Teaches Us Genetic model for a single locus gene: dominant, codominant, truly single locus Other genetic information: selection-free, completely penetrant. Data quality: systematic error, non-random sampling. Few important genes are single-locus. Often single locus analysis is used to verify marker systems. Segregation Analysis: Experimental Design Run a controlled cross with known expected segregation ratios. OR Sample offspring of particular mating type with known expected segregation ratios. Verify segregation ratios. Autosomal Dominant Mating Type Genotype DD Dd dd DDxDD 1 0 0 0.5 0.5 0 A DDxDd DDxdd 0 1 0 B DdxDd 0.25 0.5 0.25 C Ddxdd 0 0.5 0.5 ddxdd 0 0 1 Phenotype Dominant Recessive 1 0 1 0 1 0 0.75 0.25 0.5 0.5 0 1 Autosomal Dominant: The Data and Hypothesis Obtain a random sample of matings between affected (Dd) and unaffected (dd) individuals. Sample n of their offspring and find that r are affected with the disease (i.e. Dd). H0: proportion of affected offspring is 0.5 Autosomal Dominant: Binomial Test H0: p = 0.5 If r n/2 observe 29 p-value = 2P(X r) If r > n/2 p-value = 2P(X n-r) n c n P(X c) = 1 x 2 x 0 p-value = 0.32 Autosomal Dominant: Standard Normal Test m = np s2 = np(1-p) Z X np ~ N np, np1 p np1 p 1/ 2 Under H0, X ~ N(n/2,n/4) r n/2 z n / 4 1/ 2 1.13 observe 29 p-value = 0.26 Autosomal Dominant: Pearson Chi-Square Test The distribution of the sum of k squares of iid standard normal variables is defined as a chi-square distribution with k degree of freedom. 2 X np Z2 ~ 2 np1 p 1 2 2 X np n X n 1 p Z2 np n1 p 2 r n / 2 z2 n/4 1.28 p-value = 0.26 Continuity Correction Both the normal and chi-square are continuous distributions, but our data is not. Continuity correction for Normal: r = 28.5 corrected p-value = 0.32 Continuity correction for Chi-Square: r = 28.5; n-r = 21.5 corrected p-value = 0.32 Autosomal Dominant: Likelihood Ratio Test n r nr L p p 1 p Write likelihood: r r Calculate the MLE under HA: Calculate the G statistic: pˆ n c oi G 2log LA log L0 2 oi log ei i 1 r nr 2 r log n r log 0.5 0.5 2 Determine G distribution: G ~ 1 Calculate p-value = 0.26 Estimating Segregation Ratio: MOM first moment = np sample moment = r MOM: np = r MOM estimate: p r n Estimating Segregation Ratio: Likelihood Method Set score to 0: r nr 0 pˆ 1 pˆ Solve for mle: r pˆ n Estimating Confidence Interval for Segregation Ratio Our estimate is X/n, where X is the random variable representing the number of “successes” observed and n is the sample size. E(X/n) = E(X)/n = np/n = p Var(X/n) = Var(X)/n2 = np(1-p)/n2 = p(1-p)/n 1/ 2 ˆ ˆ p 1 p / n SE(X/n) = Therefore, X/n is unbiased and we can obtain a confidence interval using a normal approximation with SE(X/n). Estimating Confidence Interval for Segregation Ratio 29 pˆ 0.58 50 1/ 2 29 21 SE pˆ 50 50 50 0.0698 pˆ 1.96SE, pˆ 1.96SE 0.443,0.717 Segregation Analysis: Codominant Loci I Mating Type DDxDD DDxDd DDxdd DdxDd Ddxdd ddxdd DD 1 0.5 0 0.25 0 0 Genotype Dd 0 0.5 1 0.5 0.5 0 dd 0 0 0 0.25 0.5 1 Segregation Analysis: Codominant Loci II All 6 mating types are identifiable. Each mating type can be tested for agreement with expected segregation ratios. Some mating types result in 3 types of offspring. Must use Chi-Square or likelihood ratio test. Multiple Populations: Testing for Heterogeneity Suppose you observe segregation ratios in samples of size n in m populations. Calculate a total chi-square: m n o e 2 ij ij 2 total i 1 j 1 eij Calculate a pooled chi-square: 2 n 2 pooled j 1 m m oij eij i 1 i 1 m e i 1 ij Multiple Populations: Testing for Heterogeneity Then, 2 total 2 pooled ~ 2 n ( m1) Multiple Populations: Testing for Heterogeneity Alternatively, one may calculate G statistics. 2 Then, Gtotal –Gpooled is also distributed as n ( m1) oij Gtotal 2 oij log e i 1 j 1 ij m oij n m i 1 Gpooled 2 oij log m j 1 i 1 e ij i 1 m n Multiple Populations: Example In Mendel’s F2 cross of smooth and wrinkled inbred pea lines, he sampled 10 plants and counted the number of smooth and wrinkled peas produced by each of those plants. Is there heterogeneity between plants? Further tests show that single gene controls smooth vs. wrinkled smooth is dominant to wrinkled Screening Markers for Polymorphism An important step in designing mapping studies is to find markers that show polymorphism. We are interested in tests for polymorphism. A false negative would result if the marker was truly polymorphic, but our test showed it to be monomorphic. A false positive would result if the marker was truly monomorphic, but our test showed it to be polymorphic. Testing for Polymorphism: Backcross 1:1 You design a backcross experiment to test for polymorphism at a marker of interest. You sample n offspring of the backcross. P(monomorphic) = 2(0.5)n Testing for Polymorphism: F2 codominant 1:2:1 You design a F2 cross with a marker that is codominant. You sample n F2 individuals. P(monomorphic) = 2(0.25)n + (0.5)n Testing for Polymorphism: F2 dominant marker You design an F2 cross, but this time observe a dominant marker. You sample n F2 individuals. P(monomorphic) = (0.75)n + (0.25)n Power of Test for Polymorphism Power to Detect Polymorphism 1.2 0.8 1:1 0.6 1:2:1 0.4 3:1 0.2 Sample Size 19 17 15 13 11 9 7 5 3 0 1 Power 1 Estimating Heterozygosity l H 1 p i 1 2 i n 2 ˆ H 1 pˆ i n 1 i 1 l 2 l l n 3 2 Var Hˆ p p i 2 i n 1 i 1 i 1 Estimating Allele Frequency It is often assumed that alleles have equal frequencies when there are many alleles at a locus. This assumption can result in false positives for linkage, so it is important to test allele frequencies. Suppose there are l possible alleles A1, A2, …. You observe nij genotypes AiAj. You estimate genotypes frequencies p̂ij Estimating Allele Frequencies 1 l pˆ i pˆ ii pˆ ij 2 j i 1 Var pˆ i pi 1 pi pi2 pii 2n pi 1 pi under HWE 2n 1 pij 4 pi p j Cov pˆ i , pˆ j 4n 1 pi p j under HWE 2n Probability of Observing an Allele Suppose there is an allele Ai with frequency pi. What is the probability of sampling at least one allele of type Ai? Pobserving at least one allele Ai 1 1 pi 2n sample size calculation log 1 i n 2 log 1 pi Probability of Observing Multiple Alleles Let i be the probability of observing at least one allele of type i. l There are jm m ways of selecting m different alleles and an associated probability (jm) of detecting at least one of each calculated from the i. Then we can calculate the probability of observing k or more alleles by summing over these probabilities for k, k+1, …, l. Approximate Probability of Observing k or More Alleles The above procedure becomes computationally difficult when there are many alleles and the frequencies are unequal. There is a Monte Carlo approximation. Select a random variable Ii to be 1 with probability i and 0 otherwise. Compute I I for b bootstrap trials. The proportion of trials with Ik is an estimate of the probability of observing k or more alleles. l i 1 i Summary Permutation and combinations: knowing how to count number of genotypes, mating types, etc. Testing segregation ratios for dominant and codominant loci. Testing for population heterogeneity. Screening for polymorphism. Estimating heterozygosity, probability of observing and allele.