* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download POPULATION GENETICS LECTURE NOTES
Genetic testing wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Inbreeding avoidance wikipedia , lookup
Gene expression programming wikipedia , lookup
Medical genetics wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Koinophilia wikipedia , lookup
Genetics and archaeogenetics of South Asia wikipedia , lookup
Behavioural genetics wikipedia , lookup
Public health genomics wikipedia , lookup
Pharmacogenomics wikipedia , lookup
Designer baby wikipedia , lookup
Genetic engineering wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Genome (book) wikipedia , lookup
Heritability of IQ wikipedia , lookup
History of genetic engineering wikipedia , lookup
Group selection wikipedia , lookup
Human genetic variation wikipedia , lookup
Polymorphism (biology) wikipedia , lookup
Selective breeding wikipedia , lookup
Dominance (genetics) wikipedia , lookup
Hardy–Weinberg principle wikipedia , lookup
Genetic drift wikipedia , lookup
Training course in Quantitative Genetics and Genomics Biosciences East and Central AfricaInternational Livestock Research Institute (BecA-ILRI) Hub Nairobi, KENYA May 30-June 10, 2016 POPULATION AND QUANTITATIVE GENETICS GENOME ORGANIZATION AND GENETIC MARKERS SELECTION THEORY BREEDING STRATEGIES Samuel E Aggrey, PhD Professor Department of Poultry Science Institute of Bioinformatics University of Georgia Athens, GA 30602, USA [email protected] Preface This lecture notes was written in an attempt to cover parts of Population Genetics, Quantitative Genetics and Molecular Genetics for postgraduate students and also as a refresher for field geneticists. The course material is not a text book and not meant to be copied, duplicated or sold. This text is unedited and I am solely responsible for all conceptual mistakes, grammatical errors and typos. Genetics is a life-long course and cannot be covered in a few lectures. Only selected parts of the population- and quantitative-, and molecular genetics will be covered in this course because of time constraints. This course will cover some of the evolutionary changes in allele frequency between generations such as natural selection and gene flow, and some aspects of Quantitative and Molecular Genetics. To those men who have kept us awake for over two centuries and I believe would continue to do so for many more centuries! POPULATION GENETICS The study of composition of biological populations, and changes in genetic composition that result from operation of various factors including (a) natural selection, (b) genetic drift, (c) mutations and (d) gene flow Genetic composition 1. The number of alleles at a locus 2. The frequency of alleles at a locus 3. The frequency of genotypes at a locus 4. Transmission of alleles from one generation to the next Population A group of breeding individuals Single locus: Locus A with two alleles A1 and A2 p =P +½H q =Q +½H Derivation of the Hardy-Weinberg principle Ideal population 1. Two sexes and the population consist of sexually mature individuals 2. Mating between male and female are equal in probability (independent of distance between mates, type of genotype, age of individuals 3. Population is large and actual frequency of each mating is equal to Mendelian expectation 1 4. Meiosis is fair. We assume that there is no segregation distortion, no gamete competition, no differences in the developmental ability of eggs or fertilizing ability of sperms 5. All mating produce the same number of offspring, on average. Thus, frequency of a particular genotype in the pool of newly formed zygote is: ∑(frequency of mating) (frequency of genotype produced from mating) Frequency (A1A1 in zygotes) = P2 + ½PH +½PH +¼H2 =(P+½H)2 =p2 Frequency (A1A2) =2pq Frequency (A2A2) =q2 6. Generations do not overlap 7. There is no difference among genotype groups in the probability of survival 8. There is no migration, mutation, drift and selection Hardy-Weinberg Law In a large random mating population in the absence of mutation, migration, selection and random drift, allele frequency remains the same from generation to generation. Furthermore, there is a simple relationship between allele frequency and genotypic frequency Why is Hardy-Weinberg principle so important? Is there any population anywhere in the world or outer space that satisfies all assumptions? Possible evolutionary forces within populations cause a violation of at least one of these assumptions, and departure from Hardy-Weinberg are one way in which we detect those forces and estimate their magnitude. The most significant evolutionary factors are selection (natural or artificial), non-random mating and gene flow. 2 Fig. 1 shows the relationship between allele frequency and three genotypic frequencies for a population under Hardy-Weinberg proportions: 1. The heterozygote is the most common genotype for intermediate allele frequencies 2. One of the homozygotes is the most when the allele frequency is not intermediate 3. Only ⅓ of the time when q is between ⅓ and ⅔, is the heterozygote the most common genotype 4. When q is between 0 and ⅓ A1A1 is the most common, and when q is between ⅔ and 1, A2A2 is the most common. 5. The maximum frequency of the heterozygote occurs when q=0.5 This can be shown directly by setting the derivatives of the H-W heterozygosity, 2pq=2q(1-q), equal to zero and solving for q or d[2q(1 − q) = 2 − 4𝑞 = 0 𝑑𝑞 Here, we assume that the generations are non-overlapping, i.e. the parents die after producing progeny, and the progeny then become the next parental generation. Testing for deviation from Hardy-Weinberg Equilibrium Departure from Hardy-Weinberg equilibrium can be tested from a sample scored for their genotypes. The genetic model provided by Hardy-Weinberg generates the expected frequency at equilibrium. We can now compare observed and expected allele frequencies under the assumptions of Hardy-Weinberg proportions. The chisquare test of goodness of fit and the likelihood ratio test can be used to test departure or lack thereof from Hardy-Weinberg equilibrium. The chi-square test is an approximation to the likelihood ratio test. To perform a chi-square goodness of fit test, we first have to estimate the observed genotypic frequency from the data, 3 then use that to generate the expected genotypic frequencies. We can compute the chi-square statistic as: (𝑂 − 𝐸)2 2 𝑋 =∑ 𝐸 Where O and E are the observed and expected number of a particular genotype and n is the number of genotypic classes. From the calculated value of X2 and the table value of X2 we can obtain the probability that the observed numbers deviates from the expected numbers. The degrees of freedom used to determine the significance of X2 value are equal to the number of genotypic classes, n, minus one, then minus the number of parameters estimated from the data. One degree of freedom is always lost because we use the data to estimate allele frequency. We can use the chi-square distribution to test whether the value of X2 is too large to be the result of sampling error. In doing so we are performing a one-tailed test. The chi-square expression for two alleles is given as: (𝑁11 − p̂2 N)2 (𝑁12 − 2p̂ q̂ N)2 (𝑁22 − q̂2 N)2 𝑋 = + + p̂N 2p̂q̂ N q̂2 N 2 An alternate way to estimate differences of observed frequencies from expected frequencies is to calculate the standardized deviation of the observed frequency from the Hardy-Weinberg expectation of heterozygotes, which provides the fixation index or generally inbreeding, F. 𝐹= 2𝑝𝑞 − 𝐻 𝐻 =1− 2𝑝𝑞 2𝑝𝑞 It can be shown that 𝑋2 = 𝐹2𝑁 For two alleles, the Chi-square good of fit test for Hardy-Weinberg proportions is equivalent to the test for inbreeding, F=0. However, F is unstable as the expected (E) value approaches zero, and therefore not useful for rare and very common alleles. For E=0, O>0, F=-∞, and for E=0, and O=0, F is undefined. Deviation from Hardy-Weinberg proportions can also be tested using the likelihood ratio test which is described in most statistical texts. 4 The B/b locus is responsible for plumage color in chickens found in the Rift Valley. The B allele expresses black plumage which is completely dominant over the b allele for brown plumage. Phenotype Genotype Observed number Expected number Black BB 290 p̂2N=289.444 Black Bb 496 2p̂q̂=497.112 Brown bb 214 q̂2N=213.444 Total 1,000 1,000 P=290/1000=0.29; H=496/1000=0.496; Q=214/1000=0.214; P+H+Q=1.0 p̂=P+½H = 0.29+½(0.496)=0.538; q̂=Q+½H = 0.214+½(0.496)=0.462; p̂+q̂=1.0 Note: Chi-square is allergic to fraction and ratios, but really likes integers! (290 − 289.444)2 (496 − 497.112)2 (214 − 213.444)2 𝑋 = + + = 0.0050 289.444 497.112 213.444 2 The X2-Table at p=0.05 at 1 degree of freedom is 3.84. Since the X2 calculate is lower than X2 table, we can conclude that the data does not deviate from HardyWeinberg proportions. 𝐹 =1− 𝐻 0.496000 =1− = 0.002237 2𝑝𝑞 0.497112 𝑋 2 = 𝐹 2 𝑁 = 0.0050 5 Extension of Hardy-Weinberg’s Law: Multiple Alleles Let us consider a single locus with three alleles A1, A2 and A3 with frequencies, p, q and r, respectively. Hardy Weinberg frequencies for three autosomal alleles at a single locus Allele/ A1 A2 A3 frequency p q r A1 A1A1 A1A2 A1A3 p p2 pq pr A2 A2A1 A2A2 A2A3 q qp q2 qr A3 A3A1 A3A3 A3A3 r rp rq r2 Genotype A1A1 A1A2 A1A3 A2A2 A2A3 A3A3 TOTAL Frequency p2 pq+pq=2pq pr+pr=2pr q2 qr+qr=2qr r2 1.0 Number N11 N12 N13 N22 N23 N33 N Please note that, 𝑝 + 𝑞 + 𝑟 = 1, and they key to solving multiple alleles is to break in order for the problem to resemble a two allele problem 𝑁33 𝑓(𝐴3𝐴3) = 𝑟 2 = 𝑁 𝑁33 𝑟=√ 𝑁 From here, let’s reduce the problem to a two allele locus involving the allele, A3 Expected genotypes under H-W: A2A2, A2A3 and A3A3 with expected frequency 𝑞 2 + 2𝑞𝑟 + 𝑟 2 = 𝑁22 +𝑁23 +𝑁33 𝑁 From basic algebra: This implies: (𝑎 + 𝑏)2 = 𝑎2 + 2𝑎𝑏 + 𝑏 2 . (𝑞 + 𝑟)2 = 𝑞 2 + 2𝑞𝑟 + 𝑟 2 Therefore: (𝑞 + 𝑟)2 = 𝑁22 +𝑁23 +𝑁33 𝑁 6 . 𝑞+𝑟=√ 𝑟=√ 𝑁22 +𝑁23 +𝑁33 𝑁 𝑁22 +𝑁23 +𝑁33 𝑁 −√ 𝑁33 𝑁 Since, 𝑝 + 𝑞 + 𝑟 = 1, then 𝑝 = 1 − (𝑞 + 𝑟) 𝑝 =1−√ 𝑁22 +𝑁23 +𝑁33 𝑁 The ABO blood group in humans is determined by three alleles, A, B and O. Allele/ frequency A p B q O r Genotype AA AB AO BB BO OO A p AA p2 AB pq AO pr B q AB pq BB q2 BO qr Frequency p2 pq+pq=2pq pr+pr=2pr q2 qr+qr=2qr r2 O r AO pr BO qr OO r2 Number N11 N12 N13 N22 N23 N33 In the year 1825, the director general of ILRI-Musastan ordered a staff nurse to collect blood samples of all capacity building course participants. Of the 1,825 individuals sampled, 700 were type A, 250 were type B, 75 were type AB and 800 were type O. Determine the frequency of the A, B and O alleles. Hint: Phenotype Genotype H-W Expectation Number A AA + AO p2+2pr 700 2 B BB + BO q +2qr 250 AB AB 2pq 75 2 O OO r 800 7 Natural Selection at One Locus Differential viability and fertility Natural selection occurs when some genotypes in a population have differential survival, fertility or reproduction. In this case, we multiply each genotype’s frequency by its fitness, where fitness is a reflection of the genotype’s probability of survival and its relative participation in reproduction. Assuming a single autosomal locus population with two alleles A1 and A2 with three diploid genotypes A1A1, A1A2 and A2A2 and different fitnesses denoted w 11, w12 and w22, respectively. Unless w11, w12 and w22 are all equal, then natural selection will occur, possibly leading the genetic composition of the population to change. Before the operation of natural selection (generation 0), the genotypes are in Hardy-Weinberg equilibrium and the frequency of A1 and A2 alleles are p0 and q0, respectively (p0 + q0 = 1). The genotypes of generation 0 produces progeny that becomes generation one with frequency of A1 and A2 denoted by p1 and q1, respectively (p1 + q1 = 1). In both generations, the allele frequency is considered at the zygote stage and may different from adult allele frequency if there is differential viability. Assuming there is no mutation, and that Mendel's law of segregation is operational, then an A1A1 genotype will produce only A1 gametes, an A2A2 genotype will produce only A2 gametes, and an A1A2 genotype will produce A1 and A2 gametes in equal proportion. Therefore, the proportion of A2 gametes, and thus the frequency of the A2 allele in generation one at the zygotic stage, is: [𝑞02 𝑤22 + 12(2𝑝0 𝑞0 𝑤12 )] 𝑞1 = 𝑤 𝑞1 = 2 𝑞0𝑤 + 𝑝0 𝑞0 𝑤12 22 𝑤 8 [1] Equation [1] is known as a ‘recurrence’ equation, as it expresses the frequency of the A1 allele f generation 1 in terms of its frequency in generation 0. The change in frequency between generations can then be written as: ∆𝑞 = 𝑞1 − 𝑞0 + 𝑝0 𝑞0 𝑤12 = − 𝑞0 𝑤 𝑞02 𝑤22 + 𝑝0 𝑞0 𝑤12 − 𝑞0 𝑤 = 𝑤 𝑞02 𝑤22 If we substitute w from Table 3, (𝑞 = 1 − 𝑝), and simply the equation above to: 𝑝𝑞𝑤12 + 𝑞 2 𝑤22 − 𝑞(𝑝2 𝑤11 + 2𝑝𝑞𝑤12 + 𝑞 2 𝑤22 ) ∆𝑞 = 𝑤 𝑞(𝑝𝑞𝑤22 − 𝑝𝑞𝑤12 + 𝑝2 𝑤11 + 𝑝𝑞𝑤12 ) = 𝑤 𝑝𝑞[𝑞(𝑤22 −𝑤12 )−𝑝(𝑤11 −𝑤12 )] = [2] 𝑤 Equations [1] and [2] show, in precise terms, how fitness differences between genotypes will lead to evolutionary change. If Δq =0 then no allele frequency change has occurred and the population is in allelic equilibrium. It is worth mentioning that Δq =0 does not mean that no natural selection has occurred. The condition for that is w11=w12=w22. It is possible for natural selection to occur and have no effect on allele frequency. Directional selection If Δq > 0, then natural selection has lead the A2 allele to increase in frequency; if Δq < 0 then natural selection has led the A1 allele to increase in frequency. If w11>w12>w22, then A1A1 genotype will be fitter than A1A2, which in turn is fitter than A2A2; in which case Δq must be negative (so far as neither p nor q is 0). At each generation, the frequency of A1 allele will be greater than in the previous generation until it eventually reaches fixation and the A2 allele is eliminated from the population. Once A1 reaches fixation (p=1 and q=0) no further evolutionary changes will occur. In this case, the A1 allele confers a fitness advantage on the genotypes that carry it, and its relative frequency in the population will increase from generation to generation until it is fixed. The opposite fixation (A2) is true when w22>w12>w11. Table 4 illustrates numerical 9 example of directional natural selection. Fig. 2 illustrates allele frequency under Hardy-Weinberg proportions where there is no differential viability, w11=w12=w22=1.0 and the average fitness w=1.0 from generation to generation. Assuming w22=0.4 as in Table 4, allele frequency of A1 increases and A2 decreases non-linearly until they get into fixation as illustrated in Fig 3. Ultimately, the population will be monomorphic for the homozygote genotype with the highest fitness. Stabilizing selection An interesting situation arises when the heterozygote is superior in fitness to the two homozygotes. In this case, w11<w12>w22, and what happens in this situation is that, an equilibrium situation is reached with both alleles present in the population. Since q must be nonnegative, this condition can be satisfied only there is heterozygote superiority or inferiority-a condition also known as heterosis. In this case, natural selection produces heterogeneity and preserves gene variation. Unlike directional selection, stabilizing or balancing selection tends to keep both alleles in the population and each allele is balanced and converges at a polymorphic equilibrium (Fig 4). Disruptive selection Under disruptive selection (w11>w12<w22), the heterozygote has a lower relative fitness compared to the two homozygotes. Viability selection may lead either to an increasing frequency of A1 allele or to its decreasing frequency. In the long run, the population will be monomorphic for one of the homozygous genotypes (Fig 5). The population converges to fixation. 10 Coefficient of selection The speed with which allele or genotype frequency changes, is driven by the relative fitness for each allele or genotype. Fitness (w11, w12 and w22) is a relative value, usually measured in comparison with the most-fit allele/genotype in the population. Selection coefficient, s, measures the reduction in fitness for a selected allele or genotype compared to the most-fit allele/genotype in a population. Selection against an allele may operate either through reduced viability or reduced fertility or reduced mating ability or different combinations of the three. Therefore, allele frequency needs to be deduced from the zygote stage of the parent generation to the zygote stage of the progeny generation. The coefficient of selection measures the proportionate reduction in gametic contribution of a genotype compared to the most-fit genotype. The contribution of the most fit genotype is taken to be 1, and the contribution of the genotype selected against is 1 - s. If the selection coefficient for a genotype is 0.60; the fitness is then 0.4, which means that for every 100 zygotes produced by the most-fit genotype, only 40 are produced by the genotype selected against. Dominance To explore the effects of dominance, we can specify the fitnesses using two parameters; one representing the difference in fitness between the two homozygotes and the second to represent the degree of dominance, h (fitness of the heterozygote. Let, w11 = 1 w12 = 1 - hs w22 = 1 - s The parameter h together with s determines the fitness of the heterozygote. a. If h = 0, the heterozygote has fitness 1, the same as the A1A1 homozygote: the A1 allele is completely dominant. b. Conversely if h= 1, the fitness of the heterozygote is the same as that of the A2A2 homozygote (1-s): the A2 allele is completely dominant. c. If 0 < h< 1, the heterozygote’s fitness is somewhere between those of the homozygotes: there is incomplete dominance. d. If h= ½ exactly, the alleles have additive effects: the heterozygote fitness is the average of the two homozygotes’ fitnesses. e. If h< 0, the heterozygote’s fitness is greater than 1, and thus greater than that of the A1A1homozygote; this is called overdominance. f. Similarly, if h> 1, the heterozygote has lower fitness than the A2A2 homozygote (and of course also the A1A1 homozygote); this is underdominance. 11 Table 5 Fitness values for different fitness relationships General fitness Recessive lethal Detrimental allele Dominance Dominance Dominance Heterozygote advantage Heterozygote disadvantage A1A1 A1A2 A2A2 w11 1 1 1 1 1-s 1-s1 1+s1 w12 1 1 1-hs 1 1-s 1 1 w22 0 1-s 1-s 1-s 1 1-s2 1+s2 No dominance, selection against A2A2 No dominance, selection against A2 Partial dominance of A1, selection against A2 Complete dominance of A1, selection against A2 Complete dominance of A1, selection against A1 Overdominance, selection against A1A1 & A1A2 Underdominance, selection against A1A2 Lethal alleles These are alleles that cause an organism to die only when present in the homozygote state. If the mutation is caused by a dominant lethal allele, the heterozygote for the allele will show the lethal phenotype, the homozygote dominant is impossible. If the mutation is caused by a recessive lethal allele, the homozygote for the allele will have the lethal phenotype. Most lethal genes are recessive. Many lethal alleles prevent cell division and kill an organism at an early age. Some lethal alleles exert their effect later in life, e.g. Huntington disease characterized by progressive degeneration of nervous systems, dementia and early death between 30-50 years. Dominant lethal alleles: They modify the Mendelian 3:1 ratio to 2:1. The organism dies before they can produce progeny, so the mutant dominant allele is removed from the population in the same generation it arose. Fully dominant lethal alleles kill the carrier in both homozygous and heterozygous states. Huntington’s disease, creeper legs (short and stunted) in chicken are a dominant lethal where the homozygote does not survive. Recessive lethal alleles: The recessive lethal kills the carrier individual only in the homozygous state. They maybe in two kinds: (1) one which has no obvious phenotypic effects in the heterozygotes, and (2) on which exhibits a distinctive phenotype in the heterozygous state. In many cases, lethal alleles become operative at the onset of sexual maturity. Examples of recessive lethal in cattle are: osteopetrosis (Angus and Red Angus), pulmonary hypoplasia and anasarca (PHA) (Shorthorn). In humans, common examples are cystic fibrosis (poorly functioning Cl ion transport proteins to the lungs), Tay-Sachs disease (enzyme unable to break down specific ‘membrane lipids), sickle cell anemia and brachydactyly. The relative fitness for a recessive lethal is presented in Table 5. 12 Initial frequency Fitness Gametic contribution A1A1 p2 1 p2 A1A2 2pq 1 2pq 2 + 𝑝𝑞 𝑞𝑤22 𝑤12 A2A2 q2 0 0 Total 1 𝑤 = (1 + 𝑞) From Equation 1, 𝑞1 = 𝑤 The average fitness, w, under recessive lethal is: 𝑤 = (1 + 𝑞) 𝑝𝑞 𝑞 Therefore, 𝑞1 = = [3] 𝑝(1+𝑞) 1+𝑞 𝑞0 𝑞02 ∆𝑞 = 𝑞1 − 𝑞0 = − 𝑞0 = − 1 + 𝑞0 1 + 𝑞0 The mean fitness reaches 1 when the population is fixed for A1. The relationship given for ∆q is a recursive relationship. The allele frequency at any time t+1 is a function of the frequency at time t, or 𝑞𝑡 𝑞𝑡+1 = 1 + 𝑞𝑡 𝑞1 𝑞2 = 1 + 𝑞1 When we substitute the value of q1 from equation 3 in this expression, it becomes: 𝑞0 𝑞2 = 1 + 2𝑞0 This relationship can be generalized to give the frequency in generation t as a function of the frequency at generation 0: 𝑞0 𝑞𝑡 = 1 + 𝑡𝑞0 Since there are no recessive homozygotes, the maximum allele frequency possible is 0.5 in all heterozygotes. Fig 6 demonstrates the expected decline in frequency of recessive lethal allele at two frequencies. When the frequency of allele frequency is high, the allele frequency is reduced very quickly. High throughput data has delineated lethal haplotypes. This in theory would allow us to identify carrier animals and avoid mating them. That would eliminate recessive lethal alleles faster than elimination from natural selection. 13 Selection against recessives A1A1 Initial frequency p2 Fitness 1 Gametic contribution p2 A1A2 2pq 1 2pq 𝑞2 + 𝑝𝑞 A2A2 q2 1-s 2 q (1-s) Total 1 w=1-sq2 𝑤12 From Equation 1, From Equation 1, 𝑞1 = 𝑤22 𝑤 When selecting against recessives, w12=1, w22=1-s, and w is 1-sq2 Therefore, q1 can be written as: 𝑞 2 (1 − 𝑠) + 𝑝𝑞 𝑞1 = 1 − 𝑠𝑞 2 𝑞(1 − 𝑠𝑞) = 1 − 𝑠𝑞 2 The change in frequency of A2 is therefore given as: 𝑠𝑞 2 (1 − 𝑞) ∆𝑞 = − 1 − 𝑠𝑞 2 Both the average fitness and change in allele frequency are functions of the allele frequency and the selection coefficient. Selection against recessive alleles is very efficient at first, but becomes progressively slower because a sizeable proportion of the recessive allele is part of the heterozygotes as allele frequency decreases. Therefore, natural selection alone cannot entirely eliminate the recessive allele even if it is lethal. 14 More than one locus – Linkage and linkage disequilibrium Under random mating alleles at all autosomal loci combine at random to form genotypes to attain equilibrium under Hardy-Weinberg law. The basic assumption here is that transmission of alleles at a given locus across generations is independent of alleles at another locus. We also assume that fitness of genotypes at one locus is not affected by genotypes at another locus. For several loci, these assumptions would likely be violated. Let’s consider A locus with two alleles A1 and A2 at frequencies 𝑝𝐴 𝑎𝑛𝑑 𝑞𝐴 and a B locus also with two alleles B1 and B2 at frequencies 𝑝𝐵 𝑎𝑛𝑑 𝑞𝐵 , respectively. Under Hardy-Weinberg proportions, 𝑝𝐴 + 𝑞𝐴 = 1, 𝑎𝑛𝑑 𝑝𝐵 + 𝑞𝐵 = 1, and expected genotypic frequencies are 𝑝𝐴2 + 2𝑝𝐴 𝑞𝐴 + 𝑞𝐴2 𝑎𝑛𝑑 𝑝𝐵2 + 2𝑝𝐵 𝑞𝐵 + 𝑞𝐵2 , respectively. Alleles at A locus may combine at random or in a non-random way with alleles at the B locus. Random association of alleles showing expected gametic frequency under equilibrium Allele/ A1 A2 frequency 𝑝𝐴 𝑞𝐴 B1 A1B1 A2B1 𝑝𝐵 𝑝𝐴 𝑝𝐵 𝑝𝐵 𝑞𝐴 B2 A1B2 A2B2 𝑞𝐵 𝑝𝐴 𝑞𝐵 𝑞𝐴 𝑞𝐵 Let’s use some classical notations to represent the actual gametic frequencies. Let r, s, t and u represent the actual or observed gametic frequencies of A1B1, A1B2, A2A1 and A2A2, respectively. Under random association of gametes, 𝑟 = 𝑠 = 𝑡 = 𝑢 𝑎𝑛𝑑 𝑟 + 𝑠 + 𝑡 + 𝑢 = 1. The state of random gametic association between alleles of different genes is called LINKAGE EQUILIBRUIM. If two loci are in linkage equilibrium, it means that they are inherited completely independently in each generation. An example would be loci that are on two different chromosomes and encode unrelated, non-interacting proteins. Under random mating and other assumptions of Hardy-Weinberg equilibrium, linkage equilibrium between loci is attainable. However, unlike single 15 locus, the attainment of gametic or linkage equilibrium depends on the rate of recombination in genotypes heterozygous to both loci. There are two types of double gametic heterozygotes: 𝐴1 𝐵1 𝑐𝑜𝑢𝑝𝑙𝑖𝑛𝑔 ℎ𝑒𝑡𝑒𝑟𝑜𝑧𝑦𝑔𝑜𝑡𝑒 𝐴2 𝐵2 𝐴1 𝐵2 𝑟𝑒𝑝𝑢𝑙𝑠𝑖𝑣𝑒 ℎ𝑒𝑡𝑒𝑟𝑜𝑧𝑦𝑔𝑜𝑡𝑒 𝐴2 𝐵1 Gamete A1B1 A1B2 A2B1 A2B2 Expected frequency 𝑝𝐴 𝑝𝐵 𝑝𝐴 𝑞𝐵 𝑝𝐵 𝑞𝐴 𝑞𝐴 𝑞𝐵 Observed frequency r s t u Coupling Repulsive Repulsive Coupling The observed gametic frequency differs from the expected gametic frequency by an amount D. We measure the non-randomness of the gametic frequencies by means of deviation from two loci equilibrium. D is the gametic disequilibrium coefficient. Gametic disequilibrium is often referred to as linkage disequilibrium. This may be confusing because genes or loci need not be linked to be in gametic disequilibrium. The gametic disequilibrium coefficient, D is similar to the effect of inbreeding on genotypic frequencies at a single locus. The Heterozygote deficit interpretation of inbreeding coefficient, F, has been called a “one-locus disequilibrium” coefficient. 𝑟 = 𝑝𝐴 𝑝𝐵 + 𝐷 𝑠 = 𝑝𝐴 𝑞𝐵 − 𝐷 𝑡 = 𝑞𝐴 𝑝𝐵 − 𝐷 𝑢 = 𝑞𝐴 𝑞𝐵 + 𝐷 The most common expression of D is: 𝐷 = 𝑟𝑢 − 𝑠𝑡 D is therefore the difference between the coupling and repulsive gametic types. 𝐷 = (𝑝𝐴 𝑝𝐵 + 𝐷)(𝑞𝐴 𝑞𝐵 + 𝐷) − (𝑝𝐴 𝑞𝐵 − 𝐷)(𝑞𝐴 𝑝𝐵 − 𝐷) [You can work on the proof in your spare time]. If two genes are in linkage disequilibrium, it means that certain alleles of each gene are inherited together more often than would be expected by chance. This may be due to actual genetic linkage, i.e., the genes are closely located on the 16 same chromosome. Or it could be due to some form of functional interaction where some combinations of alleles at the two loci affect the viability of potential offspring. It should be noted that an observed non-random association of alleles/genotypes need not be caused by their chromosomal location. Any of the evolutionary forces (mutation, random genetic drift, selection and gene flow) can, at least temporarily, cause such associations. Recombination Let’s consider the following: The gametes produced by this genotype A1B1/A2B2 are of four types: Type 1: Type 2: Type 3: Type 4: A1B1 A1B2 A2B1 A2B2 non-recombinant with frequency recombinant with frequency recombinant with frequency non-recombinant with frequency (1-c)/2 c/2 c/2 (1-c)/2 Gametic types 1 and 2 are called non-recombinants because the gametes are associated with in the same manner as previous generation. Gametic types 3 and 4 are known as recombinants because the gametes are associated differently than in the previous generation. As a result of Mendelian segregation, f(A1B1)=f(A2B2); and f(A1B2)=f(A2B1). However, the 𝑓(𝐴1𝐵2) + 𝑓(𝐴2𝐵1) does not have to be equal to 𝑓(𝐴1𝐵1) + 𝑓(𝐴2𝐵2). The proportion of recombinant gametes produced by the double heterozygote is called the recombination fraction, c and the proportion of non-recombinant gametes is 1-c. The recombination fraction between genes depends on whether they are on the same chromosome, and also the physical distance between them. During meiosis, the four chromatids (of two genes) align. The two inner chromatids can undergo breakage and exchange of parts (recombination) between the two chromatids. Thus, only 50% or (0.5) of the chromatids can undergo recombination. Therefore, the maximum recombination rate, cmax=0.5. For genes on different chromosomes or far apart on the same chromosome, the recombination fraction, c=0.5 as the four gametic types are produced in equal frequency. Genes that have c<0.5 must necessarily be the same chromosome, and such genes are said to be linked. When c=0, the two genes are very close to each other such that break almost never happens, and they are transmitted together as “one super gene”. 17 Gametic disequilibrium and frequency of gamete change over time The gametic disequilibrium changes from one generation to the next. Let the frequencies of A1B1, A1B2, A2B1 and A2B2 be r, s, t and u, respectively. Now, let’s construct the gametic frequency of offspring. Proportion among gametes Genotype A1B1 A1B2 A2B1 A2B2 A1B1/A1B1 1 0 0 0 A1B1/A1B2 ½ ½ 0 0 A1B1/A2B1 ½ 0 ½ 0 A1B1/A2B2 ½(1-c) ½c ½c ½(1-c) A1B2/A1B2 0 1 0 0 A1B2/A2B1 ½c ½(1-c) ½(1-c) ½c A1B2/A2B2 0 ½ 0 ½ A2B1/A2B1 0 0 1 0 A2B1/A2B2 0 0 ½ ½ A2B2/A2B2 0 0 0 1 There are ten different two-locus genotypes, therefore full mating table would take 100 rows. Assuming Hardy-Weinberg equilibrium, we can calculate the frequency with which any one genotype will produce a particular gamete. Genotype A1B1/A1B1 A1B1/A1B2 A1B1/A2B1 A1B1/A2B2 Genotype and the frequency of their progeny gametes Gametes Frequency A1B1 A1B2 A2B1 2 r 2rs 2rt 2ru A1B2/A1B2 A1B2/A2B1 A1B2/A2B2 s2 2st 2su A2B1/A2B1 A2B1/A2B2 t2 2tu A2B2/A2B2 Total u2 1 A2B2 2 r rs rt (1-c)ru rs (c)ru s2 (1-c)st su (c)st rt (c)ru (1-c)st t2 tu (1-c)ru (c)st su tu u2 𝑟 ′ = 𝑟 − 𝑐𝐷0 𝑠 ′ = 𝑠 − 𝑐𝐷0 18 𝑡 ′ = 𝑡 − 𝑐𝐷0 𝑢′ = 𝑢 − 𝑐𝐷0 The frequencies of the four gametes after one generation of selection are: 𝑟 ′ = 𝑟 − 𝑐𝐷0 𝑠 ′ = 𝑠 − 𝑐𝐷0 𝑡 ′ = 𝑡 − 𝑐𝐷0 𝑢′ = 𝑢 − 𝑐𝐷0 where D0 is the LD at the preceding generation. 𝐷1 = 𝑟 ′ 𝑢′ − 𝑠 ′ 𝑡 ′ = [(𝑟 − 𝑐𝐷0 )(𝑢 − 𝑐𝐷0 )] − [(𝑠 − 𝑐𝐷0 )(𝑡 − 𝑐𝐷0 )] This recursive relationship leads to a general relationship: 𝐷𝑡 = 𝐷0 (1 − 𝑐)𝑡 where Dt is the D at generation, t. The LD decays each generation at a rate determined by the degree of recombination. The maximum value of D (+0.25) occurs when there are only coupling gametes (r=u=0.5). The minimum value of D (-0.25) occurs when there are only repulsive gametes (s=t=0.5). Thus, the value of D varies from -0.25 to +0.25. If there is free recombination between two loci (either on different chromosomes or far apart from each other where c=½, D would be eliminated in about 7 generations (D7=0.00195). However, if c is much less than 0.5, e.g. 0.05, then the decay in disequilibrium will take a substantial period of time. A major problem with D is that, its maximum value changes as a function of allele frequencies at the two loci. As a result, a standardizing D to the maximum possible value was proposed by Lewontin (1964), where 𝐷 𝐷′ = 𝐷𝑚𝑎𝑥 Dmax is equal to the lesser of 𝑝𝐴 𝑞𝐵 𝑜𝑟 𝑝𝐵 𝑞𝐴 if D is positive or less of 𝑝𝐴 𝑞𝐴 𝑜𝑟 𝑝𝐵 𝑞𝐵 if D is negative. 𝐷′ varies between -1 and 1 regardless of the allele frequency at the two loci, and it also provides a matrix to compare LD to be to the maximum possible value it can be. To determine how long it takes for D to decay to a given value D*, the recursive equation for Dt can be solved for the number of generations, t, as: 𝐿𝑁(𝐷∗ /𝐷) 𝑡= 𝐿𝑁(1 − 𝑐) When c=0.1, it will take 6.58 and 28.43 years for half and 90% of the LD, respectively to disappear, however, for c=0.05, it will take 13.51 and 44.89 years, respectively for half and 90% of the LD to disappear. 19 The gametic disequilibrium coefficient, r is also used as a measure of LD: 𝐷2 2 𝑟 = 𝑝𝐴 𝑝𝐵 𝑞𝐴 𝑞𝐵 where r is the square root of above equation. When the allele frequencies are the same at both loci, r, ranges from 0 to 1. When the allele frequencies are different at both loci both r2 and r are somewhat smaller. The value of the Chi-square, X2 is numerically equal to r2N, where N is the total number of chromosomes examined. The biological meaning of r is that it is the correlation between alleles present in the same chromosome. APPLICATION Originally the definition of LD was in terms of gametic frequencies because that allowed for the possibility that the loci are on different chromosomes. However, the usual application now is to loci on the same chromosome. In that case, the allele pair AB is a haplotype, and 𝑝𝐴𝐵 is the observed haplotype frequency. 𝐷𝐴𝐵 is estimated from the allele and haplotype frequencies in the sample. 𝐷𝐴𝐵 = 𝑃𝐴𝐵 − 𝑃𝐴 𝑃𝐵 The quantity 𝐷𝐴𝐵 is the coefficient of linkage disequilibrium defined for a specific pair of alleles, A and B, and does not depend on how many other alleles are at the two loci. Each pair of alleles has its own D. The values for different pairs of alleles are constrained by the fact that the allele frequencies at both loci and the haplotype frequency have to add up to 1. If both loci have two alleles, e.g. SNPs, the constraint is strong enough that one value of D is needed to characterize LD between those loci, and 𝐷𝐴𝐵 = −𝐷𝐴𝑏 = −𝐷𝑎𝐵 = 𝐷𝑎𝑏 , where a and b are the other alleles. In this case, the D is used without a subscript. The sign of D is arbitrary and depends on which pair of alleles one starts with. Higher-order disequilibria: The disequilibria can be considered for alleles at three or more loci. For alleles at three loci (A, B, and C) the third-order coefficient is: 𝐷𝐴𝐵𝐶 = 𝑃𝐴𝐵𝐶 − 𝑃𝐴 𝐷𝐵𝐶 − 𝑃𝐵 𝐷𝐴𝐶 − 𝑃𝐶 𝐷𝐴𝐵 − 𝑃𝐴 𝑃𝐵 𝑃𝐶 Where 𝐷𝐴𝐵 , 𝐷𝐵𝐶 𝑎𝑛𝑑 𝐷𝐴𝐶 are pairwise disequilibrium coefficients, and 𝐷𝐴𝐵𝐶 can be viewed as analogous to the three-way interaction term in an analysis of variance 20 and can be interpreted as the non-independence among these alleles that is not accounted for by the pairwise coefficients. Another measure is 𝜕𝐴 defined to be: 𝜕𝐴 = 𝑝𝐴 + 𝐷⁄𝑝𝐵 It is a conditional probability that a chromosome carries an A allele, given that it carries a B allele. It is useful for characterizing the extent to which a particular allele is associated with a genetic disease. Estimating and testing significance of Linkage Disequilibrium For most populations the only information available is the frequency distribution of multi-locus genotypes while the gametic composition of most zygotes can be resolved from the genotype (e.g. an A1A2B1B1 must come from A1B1 and A2B1 gametes), double heterozygotes which can come from the union of A1B1 and A2B2 or A1B2 and A2B1 gametes, cannot be resolved definitely. Assuming random mating, it is not necessary to discriminate between coupling and repulsive heterozygotes. In this case, the unbiased estimator of D is given by ̂𝐴1𝐵1 = 𝐷 𝑁 4𝑁𝐴1𝐴1𝐵1𝐵1 + 2(𝑁𝐴1𝐴1𝐵1𝐵2 + 𝑁𝐴1𝐴2𝐵1𝐵1 ) + 𝑁𝐴1𝐴2𝐵1𝐵2 [ − 2𝑝̂𝐴1 𝑝̂ 𝐵1 ] 𝑁−1 2𝑁 where N is the total sample size, the terms in the numerator are observed numbers of the four genotypes, and 𝑝̂𝐴1 and 𝑝̂𝐵1 are estimates of allele frequency. Examples of LD A1A1 A1A2 A2A2 Total A1A1 A1A2 A2A2 B1B1 40 10 4 54 B1B2 60 48 14 122 A locus PA=128/266=0.4812 HA=94/266=0.3534 QA=44/266=0.1654 pA=0.4812+½(0.3534)=0.6579 qA=0.1654+½(0.3534)=0.3421 B2B2 28 36 26 90 B1B1 B1B2 B2B2 21 Total 128 94 44 266 B locus PB=54/266=0.2030 HB=122/266=0.4586 QB=90/266=0.3383 pB=0.2030+½(0.4586)=0.4323 qB=0.3383+½(0.3383)=0.5677 ̂0 = 𝐷 266 4 ∗ 40 + 2(60 + 10) + 48 [ − 2 ∗ 0.6579 ∗ 0.4323] = 0.0856 266 − 1 2 ∗ 266 what does this mean? Since D̂ is positive, the maximum value of D is the lesser of qApB or pAqB. Since qApB = 0.3421*0.4323 =0.1479, and pAqB =0.6579*0.5677=0.3735 we chose the former. Therefore, 𝐷 0.0856 𝐷′ = = = 0.5790 𝐷𝑚𝑎𝑥 0.1479 This tells us that D̂ is about 57.90% of its maximum value. With a given recombination rate, c, the value of D̂ will change over time. 𝐷2 0.08562 2 𝑟 = = = 0.1327 𝑝𝐴 𝑝𝐵 𝑞𝐴 𝑞𝐵 0.6579 𝑥 0.4323 𝑥 0.3421 𝑥 0.5677 𝑋 2 = 𝑟 2 𝑁 = 0.1327 𝑥 266 = 35.2868 There are 4 chromosomal types, and since we estimated two allele frequencies from the data, the degrees of freedom=4-1-2=1. Since 35.2868 is greater than X2 value at p=0.05, at 1 df (=3.84), we can conclude that the gametic types are no in linkage equilibrium. LD with SNP data Without considering distance between two polymorphic SNPs, let’s visualize the following on bovine chromosome 1: SNP1 SNP2 AGGT CCT…………..GATT CAA AGGT CCT…………..GATT CAA Allele 1 2 SNP1 Allele Frequency G pA C qA Allele 1 2 22 SNP2 Allele Frequency A pB T qB Combination of SNPs into haplotypes SNP2 Allele A T SNP1 G GA GT C CA CT Haplotype GA GT CA CT Expected frequency pApB pAqB qApB qAqB Observed frequency r+D s-D t-D u+D Let’s consider some SNP data from 1,000 bulls GA = 280; GT =300; CA = 75; CT=245 Haplotype Observed Number Observed frequency Allele Allele frequency Haplotype Expected frequency GA GT CA CT 280 300 75 345 r=0.2800 s=0.3000 t=0.0750 u=0.3450 G C T A pA=0.580 qA=0.420 pB=0.645 qB=0.355 GA GT CA CT 0.58*0.355=0.2059 0.58*0.645=0.3741 0.42*0.355=0.1491 0.42*0.645=0.2709 𝐷0 = (𝑟𝑢 − 𝑠𝑡) = (0.28𝑥0.345) − (0.30𝑥0.075) = 0.0741 Alternatively, DGA can also be calculated as: 𝐷𝐺𝐴 = 𝑟 − 𝑝(𝐺) 𝑥 𝑝(𝐴) = 0.2800 − 0.2059 − 0.0741 23 The gametic frequency in a 1,000 chicken population for the naked neck (Na/na) and dominant I (I/i) are as follows: Na-I 0.180 r Na-i 0.707 s na-I 0.061 t na-i 0.052 u Expected allele frequency f(Na) = f(Na-I) + f(Na-i) = 0.180 + 0.707 f(na) = f(na-I) + f(na-i) = 0.061 + 0.052 f(Na) + f(na)= 0.887 + 0.113 = 0.887=𝑝𝐴 = 0.113=𝑞𝐴 = 1.000 f(I) = f(Na-I) + f(na-I) = 0.180 + 0.061 = 0.241=𝑝𝐵 f(i) = f(Na-i) + f(na-i) = 0.707 + 0.052 = 0.759=𝑞𝐵 f(I) + f(i)= 0.887 + 0.113 = 1.000 Expected gametic frequencies under Hardy-Weinberg equilibrium f(Na-I) = f(Na) x f(I) = 0.887 x 0.241 = 0.2138 f(Na-i) = f(Na) x f(i) = 0.887 x 0.759 = 0.6732 f(na-I) = f(na) x f(I) = 0.113 x 0.241 = 0.0272 f(na-i) = f(na) x f(i) = 0.113 x 0.759 = 0.0858 𝐷0 = 𝑟𝑢 − 𝑠𝑡 = (0.180 𝑥 0.052) − (0.707 𝑥 0.061) = −0.0338 Observed frequency = Expected frequency + D0 Observed frequency of Na-I = [f(Na) x f(I)] + D0 = 0.2138 – 0.0338 = 0.1800 24 The decay in LD is shown in Fig 7 under to different recombination. When there is no linkage (c=½), LD be almost zero by generation 7. However, it takes much longer for LD to decay when recombination is closer to 0. Since D̂ is negative, the maximum value of D is the lesser of or pAqA or pBqB. Since pAqA f(Na) x f(na) = 0.877 x 0.113 =0.1002, and pBqB =0.241 x 0.759=0.1829 we chose the former. Therefore, 𝐷 −0.03377 𝐷′ = = = −0.3369 𝐷𝑚𝑎𝑥 0.10020 This tells us that D̂ is about 33.69 % of its maximum value. The observed frequency at generation t = Expected frequency at t=0 + Dt where 𝐷𝑡 = 𝐷0 (1 − 𝑐)𝑡 where c is the recombination rate. Assuming c=0.1, at generation 2, D2 = -0.0274. The observed frequency of Na-I will be 0.2138-0.0274=0.1864. Now we can test whether D0 is significantly different from zero or not using Chisquare. Null Hypothesis: The observed gametic frequencies do not deviate from the expected gametic frequencies 2 Since X is allergic to frequencies and fraction, we have to use observed and expected numbers. (180 − 213.8)2 (707 − 673.2)2 (61 − 27.2)2 (52 − 85.8)2 𝑋 = + + + = 62.3571 213.8 673.2 27.2 85.8 2 Degrees of freedom = 4-1-1 (for estimating f(Na) from the data) – 1(for estimating f(I) from the data=1. X2table, 1 df at p=0.05=3.84. We can reject the null hypothesis and conclude that the observed gametic frequencies are not in equilibrium or in linkage disequilibrium. Population genetics of LD Linkage disequilibrium is affected by the following: Selection (both natural and artificial) Genetic drift Population subdivision and bottlenecks Inbreeding, inversion and gene conversion Applications of LD Mutation, gene mapping, QTL studies, Genome breeding value estimation Detecting natural selection 25 Population structure and Gene flow So far we have assumed that a population is ‘homogeneous’, and the characteristics of the subpopulations sampled from the population would be identical. This assumption may not be true. The distribution of individuals and gene (allele) flow connections between different subpopulations can be important in evolution. By population structure a population geneticist mean that, instead of a single, simple population, the population may have substructure, i.e., differences in genetic variation among the subpopulations due to different evolutionary reasons (genetic drift, nonrandom mating, selection, etc.). The overall population of subpopulations is referred to as the total population (T). Individual component of the total population is referred to as subpopulations (S), local populations or demes. In many real populations, there may not be obvious structure, and the population is continuous. However, even in effectively continuous populations, different areas or regions can have different allele frequency because the mating in the total population is usually nonrandom. In humans within a country with the same language, most often, there are language differences suggesting substructure, but it is always difficult to find the exact boundary where the changeover occurs. Such a population is structured, but continuous in space. Population structure can therefore be defined as when subpopulations deviate from Hardy-Weinberg proportions. Reduction in Heterozygosity is one of the major consequences of population substructure. The deviation from expected heterozygote frequency in a population is called inbreeding, F. The inbreeding coefficient, F compares the actual heterozygotes from the expected heterozygote frequency under Hardy-Weinberg equilibrium. The heterozygosity (𝐻𝐸 ) under equilibrium is the frequency of the heterozygotes (2pq). With inbreeding, 𝐻𝐸 reduces by a factor 1 − 𝐹. Therefore, the observed frequency of heterozygotes (𝐻0 ) becomes 2𝑝𝑞(1 − 𝐹). 𝐻𝐸 − 𝐻0 𝐻0 =1− 𝐻𝐸 𝐻𝐸 The reduction in heterozygote frequency is implicit with increases in the frequency of homozygotes. The reduction in heterozygote frequency is divided equally among the homozygotes. Change in heterozygote frequency is given as 𝐹= 𝐻𝐸 − 𝐻0 = 2𝑝𝑞 − 2𝑝𝑞(1 − 𝐹) = 2𝑝𝑞 − [2𝑝𝑞 − 2𝑝𝑞𝐹] = 2𝑝𝑞𝐹 26 This implies, the two homozygotes would have their respective frequencies 2𝑝𝑞𝐹 increase by ( ) = 𝑝𝑞𝐹. The reason why the reduced heterozygotes are divided 2 equally to the two homozygotes is that each heterozygote genotype has one of the two alleles. The observed and expected genotypic frequency is therefore given as: Expected genotypic frequency under inbreeding 𝐴1 𝐴1 𝐴1 𝐴2 𝐴2 𝐴2 2 Expected genotype frequency 2𝑝𝑞 𝑝 𝑞2 Observed genotype frequency 2𝑝𝑞(1 − 𝐹) 𝑝2 + 𝑝𝑞𝐹 𝑞 2 + 2𝑝𝑞𝐹 If a gene has multiple alleles, 𝐴1 , 𝐴2 , … 𝐴𝑛 with respective frequencies 𝑝1 , 𝑝2 , … , 𝑝𝑛 where 𝑝1 + 𝑝2 + ⋯ + 𝑝𝑛 = 1, with inbreeding coefficient, F, then 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝐴𝑖 𝐴𝑖 = 𝑝𝑖2 (1 − 𝐹) + 𝑝𝑖 𝐹 { 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝐴𝑖 𝐴𝑗 = 2𝑝𝑖 𝑝𝑗 (1 − 𝐹) F coefficients If individuals mate within subpopulations, they would likely mate with related individuals than if they mated randomly over the entire population. Sewall Wright provided an approach to partitioning the genetic variation in subpopulations that provides an obvious description of differentiation. If 𝐻𝑇 𝑎𝑛𝑑𝐻𝑠 are the measure of heterozygosity in the total and average of the subpopulations, respectively, Wright’s fixation index, 𝐹𝑆𝑇 which measures the average change in heterozygosity in subpopulations relative to the total heterozygosity as: 𝐻𝑇 − 𝐻𝑆 𝐻𝑆 𝐹𝑆𝑇 = =1− 𝐻𝑇 𝐻𝑇 If individuals are mated at random within the whole population, then 𝐻𝑇 = 2𝑝𝑞. On the other hand, if there is spatial structure and individuals mate within subpopulations, then the frequency of heterozygotes will depend on the allele frequency in that subpopulation, 𝐻𝑘 = 2𝑝𝑖𝑘 𝑞𝑖𝑘 𝑓𝑜𝑟 𝑠𝑢𝑏𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛, 𝑘 If there are a total of k subpopulations, then 𝑘 𝐻𝑆 = ∑ 2𝑝𝑖 𝑞𝑖 𝑖=0 27 Within each subpopulation, there can be a deviation from expected heterozygotes within that subpopulation. Using the same logic, 𝐻𝑆 − 𝐻𝐼 𝐻𝐼 𝐹𝐼𝑆 = =1− 𝐻𝑆 𝐻𝑆 where 𝐹𝐼𝑆 is a measure of the deviation from Hardy-Weinberg proportions of expected heterozygotes within subpopulations. Similarly, 𝐹𝐼𝑇 measures the deviation from Hardy-Weinberg proportions of expected heterozygotes within the whole population. 𝐹𝐼𝑇 = 𝐻𝑇 − 𝐻𝐼 𝐻𝐼 =1− 𝐻𝑇 𝐻𝑇 The heterozygosity 𝐻𝐼 within subpopulations is calculated from the observed heterozygote frequency within the subpopulation. Consequently, 1 − 𝐹𝐼𝑆 = 𝐻𝐼 𝐻𝑆 ; 1 − 𝐹𝑇 = Since, 𝐻𝐼 = 𝐻𝑆 (1 − 𝐹𝐼𝑆 ), 1 − 𝐹𝑇 = 𝐻𝐼 𝐻𝑇 𝑎𝑛𝑑 1 − 𝐹𝑆𝑇 = 𝐻𝑆 (1−𝐹𝐼𝑆 ) 𝐻𝑇 and 𝐻𝑆 𝐻𝑇 𝐻𝑆 𝐻𝑇 = 1 − 𝐹𝑆𝑇 1 − 𝐹𝑇 = (1 − 𝐹𝑆𝑇 )(1 − 𝐹𝐼𝑆 ) If individuals are mating completely at random over the entire population, then there will be no local variation in allele frequency and each subpopulation will have the same expected heterozygosity as the total population. In that case 𝐹𝑆𝑇 =0 and there will be no differentiation among subpopulations. At the other extreme, if each subpopulation is completely isolated and alleles have become fixed within each subpopulation, then there is no heterozygosity within the subpopulations. In that case 𝐹𝑆𝑇 =1 and there is maximum differentiation among subpopulations 28 Practical example: A population of 1,600 individuals was divided into three subpopulations and genotyped for the gene responsible for juicy meat in a delicacy goat breed in Yourland. AA Subpopulation 1 Subpopulation 2 Subpopulation 3 Total population Aa aa Observed numbers 125 250 125 55 30 15 80 440 480 260 720 620 500 100 1,000 1,600 Subpopulation 1 𝑃1 = 125 250 125 = 0.25; 𝐻1 = = 0.50; 𝑄1 = = 0.25; 𝑝1 = 𝑃1 + ½𝐻1 = 0.5; 𝑞1 = 0.5 500 500 500 Subpopulation 2 𝑃2 = 55 30 15 = 0.55; 𝐻2 = = 0.30; 𝑄2 = = 0.15; 𝑝2 = 𝑃2 + ½𝐻2 = 0.7; 𝑞1 = 0.3 100 100 100 Subpopulation 3 𝑃3 = 80 440 480 = 0.08; 𝐻3 = = 0.44; 𝑄3 = = 0.48; 𝑝3 = 𝑃3 + ½𝐻3 = 0.3; 𝑞1 = 0.7 1000 1000 1000 Total population 𝑃𝑇0 = 260 720 620 = 0.1625; 𝐻𝑇0 = = 0.45; 𝑄𝑇0 = = 0.3875; 1600 1600 1600 𝑝𝑇0 = 𝑃𝑇 + ½𝐻𝑇 = 0.3875; 𝑞𝑇0 = 0.6125 AA Subpopulation 1 Subpopulation 2 Subpopulation 3 Total population Aa aa Expected numbers 125 250 125 49 42 9 90 420 490 240.2496 759.5008 600.2496 29 500 100 1,000 1,600 Expected frequency: 𝐴𝐴1 = 𝑝12 = 0.52 = 0.25; 𝐴𝑎1 = 2𝑝1 𝑞1 = 2𝑥0.5𝑥0.5 = 0.50; 𝑎𝑎1 = 𝑞12 = 0.52 = 0.25 𝐴𝐴2 = 𝑝22 = 0.72 = 0.49; 𝐴𝑎2 = 2𝑝2 𝑞2 = 2𝑥0.7𝑥0.3 = 0.42; 𝑎𝑎2 = 𝑞22 = 0.32 = 0.09 𝐴𝐴3 = 𝑝32 = 0.32 = 0.09; 𝐴𝑎3 = 2𝑝3 𝑞3 = 2𝑥0.3𝑥0.7 = 0.42; 𝑎𝑎3 = 𝑞32 = 0.72 = 0.49 2 𝐴𝐴𝑇0 = 𝑝𝑇0 = 0.38752 = 0.150156; 𝐴𝑎 𝑇0 = 2𝑝𝑇0 𝑞𝑇0 = 2𝑥0.3875𝑥0.6125 = 0.474688; 2 𝑎𝑎 𝑇0 = 𝑞𝑇0 = 0.61252 = 0.375156 Inbreeding coefficient in subpopulations and total population 𝐻1 𝐻1 0.50 𝐹𝑠1 = 1 − =1− =1− = 0.000 𝐻𝐸1 2𝑝1 𝑞1 0.50 𝐹𝑠2 = 1 − 𝐹𝑠3 = 1 − 𝐹𝑇0 = 1 − 𝐻2 𝐻2 0.30 =1− =1− = 0.2857 𝐻𝐸2 2𝑝2 𝑞2 0.42 𝐻3 𝐻3 0.44 =1− =1− = −0.0476 𝐻𝐸3 2𝑝3 𝑞3 0.42 𝐻𝑇0 𝐻𝑇0 0.450000 =1− =1− = 0.0520 𝐻𝐸𝑇0 2𝑝𝑇0 𝑞𝑇0 0.474688 In subpopulation 1, the observed heterozygotes are the same as expected. In subpopulation 2, there are less heterozygotes observed than expected In subpopulation 3, there are more heterozygotes than expected The observed and expected genotypic frequency in subpopulation 2: 𝐹𝑠2 = 0.2857 𝑎𝑛𝑑 𝑝𝑞𝐹 = 0.059997 𝐴1 𝐴1 𝐴1 𝐴2 𝐴2 𝐴2 2 2 Expected genotype frequency 2𝑝𝑞 = 0.42 𝑝 = 0.49 𝑞 = 0.09 2 Observed genotype frequency 𝑝 + 𝑝𝑞𝐹 2𝑝𝑞(1 − 𝐹) 𝑞 2 + 2𝑝𝑞𝐹 = 0.30 = 𝐻2 = 0.55 = 𝑃2 = 0.15 = 𝑄2 𝐻𝐼 = 𝐻1 𝑁1 + 𝐻2 𝑁2 + 𝐻3 𝑁3 0.5𝑥500 + 0.30𝑥100 + 0.44𝑥1000 = = 0.4500 𝑁 1600 𝐻𝑆 = 𝐻𝐸1 𝑁1 + 𝐻𝐸2 𝑁2 + 𝐻𝐸3 𝑁3 0.5𝑥500 + 0.42𝑥100 + 0.42𝑥1000 = = 0.445 𝑁 1600 𝐻𝑇 = 2𝑝𝑇0 𝑞𝑇0 = 2𝑥0.3875𝑥0.6125 = 0.474688 30 𝐹𝐼𝑆 = 1 − 𝐻𝐼 0.450 =1− = −0.0112 𝐻𝑆 0.445 𝐹𝑆𝑇 = 1 − 𝐻𝑆 0.445 =1− = 0.0632 𝐻𝑇 0.475 𝐹𝐼𝑇 = 1 − 𝐻𝐼 0.450 =1− = 0.0526 𝐻𝑇 0.475 Verification 1 − 𝐹𝑇 = (1 − 𝐹𝑆𝑇 )(1 − 𝐹𝐼𝑆 ) (1 − 0.0526) = (1 − 0.0632)(1 − (−0.0112)) 0.94734 = 1.0112𝑥0.9368 Some general conclusions Subpopulation 1 is consistent with Hardy-Weinberg proportions Subpopulation 2 has experiences some inbreeding Subpopulation 3 may have experienced heterozygous advantage through disassortative mating since it has more heterozygotes than expected. Conclusion concerning the overall degree of genetic differentiation (𝑭𝑺𝑻 ) Subdivision of population, possibly due to genetic drift accounts for 6.32% of the total genetic variation. The differentiation led to deficiency of heterozygotes over the total population. 31 QUANTITATIVE GENETICS Genetic decomposition of a locus on the phenotype The nature of quantitative traits: A quintessential question all quantitative geneticists ask is: How much of the variation in a population with respect to a particular trait is due to genetic causes and how much is due to environmental factors? The phenotype (P) can be partitioned into a genotypic value (G) and an environmental deviation (E). 𝑃 =𝐺+𝐸 We will focus our attention on the genetic component, G. Let’s consider a single gene A with two alleles A1 and A2 combining into A1A1, A1A2 and A2A2 Let 𝑎, −𝑎 𝑎𝑛𝑑 𝑑 be the arbitrary genotypic values for A1A1, A1A2 and A2A2, respectively. The difference between the two homozygous is 2a. The value of a is a deviation from 0 (mid-point), which is the average of the two homozygotes. The heterozygote, A1A2 has a value of d = ak, where k is the degree of dominance. The alleles A1 and A2 behave in a completely additive manner when k=0. When k=+1, means the A1 allele is completely dominant over A2 allele; and when k=-1, means the A2 allele is completely dominant over the A1 allele. If k>+1 means over dominance, and if k<-1 mean under dominance. Let’s look at some data set. The genotypic values of an AluI polymorphic site at the 5’-region of the bovine growth hormone receptor gene for milk fat are as follows: AluI (-/-): -25 designated (A2A2) AluI(+/-): -23 designated (A1A2) AluI(+/+): -10 designated (A1A1) The midpoint of the two homozygotes = [-25 + (-10)]/2 =-17.5. The value of a=-10-(-17.5) = 7.5 and d = -23-(-17.5)= -5.5; k=d/a = -5.5/7.5=-0.73. 32 Population mean Let’s estimate the population mean (μ) of N individuals assuming a single locus with two alleles. Genotype A1A1 A1A2 A2A2 Expression of Population Mean Frequency Genotypic value 2 +a 𝑝 d 2𝑝𝑞 2 -a 𝑞 𝜇= Frequency x value 𝑝2 𝑎 2𝑝𝑞𝑑 −𝑞 2 𝑎 ∑ 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑥 𝑣𝑎𝑙𝑢𝑒 ∑ 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑝2 𝑎 − 𝑞 2 𝑎 + 2𝑝𝑞𝑑 𝜇𝐺 = 𝑝2 + 2𝑝𝑞 + 𝑞 2 The denominator is equal to 1. The numerator can be rewritten as: 𝑎(𝑝2 − 𝑞 2 ) + 2𝑝𝑞𝑑 𝑝2 − 𝑞 2 = (𝑝 + 𝑞)(𝑝 − 𝑞) Therefore, the population mean can be written as: 𝜇𝐺 = 𝑎(𝑝 − 𝑞) + 2𝑝𝑞𝑑 The homozygotes contribute a(p - q) and the heterozygote contributes 2pqd to the population mean. From Fig 9, the population mean depends on allele frequency. The population mean decreases with increasing frequency of the unfavorable allele (Fig 9a). The population mean increases with increasing frequency of the favorable allele (Fig 9b). 33 Population mean under additivity (k=0): We have already established that d=ka, therefore, when k=0, d=0. 𝜇𝐺 = 𝑎(𝑝 − 𝑞) Since p = 1 – q, 𝜇𝐺 = 𝑎(1 − 𝑞 − 𝑞) = 𝑎(1 − 2𝑞) Population mean under complete dominance (k=1): Under complete dominance, k=1, which means d=a 𝜇𝐺 = 𝑎(𝑝 − 𝑞) + 2𝑝𝑞𝑎 𝜇𝐺 = 𝑎(1 − 𝑞 − 𝑞) + 2𝑎𝑞(1 − 𝑞) 𝜇𝐺 = 𝑎 − 2𝑞 + 2𝑎𝑞 − 2𝑎𝑞 2 ) 𝜇𝐺 = 𝑎(1 − 2𝑞 2 ) Genetic Model The genotypic value of an individual can be written in term of the genetic decomposition of the genotype. 𝐺 =𝐴+𝐷+𝐼 The genotypic value equals the breeding value A, dominance deviation, D and epistasis deviation. For simplicity, we will ignore the epistatic deviation and concentrate on breeding value or additive value and dominance deviation. 𝐺 =𝐴+𝐷 Genotypic value, G The genotypic value can be written as a deviation from the population mean. 𝐺𝐴1𝐴1 = 𝑎 − 𝜇𝐺 𝐺𝐴1𝐴2 = 𝑑 − 𝜇𝐺 𝐺𝐴2𝐴2 = −𝑎 − 𝜇𝐺 𝐺𝐴1𝐴1 = 𝑎 − [𝑎(𝑝 − 𝑞) + 2𝑝𝑞𝑑 = 𝑎 − 𝑝𝑎 + 𝑞𝑎 − 2𝑝𝑞𝑑 = 𝑎(1 − 𝑝 + 𝑞) − 2𝑝𝑞𝑑 = 𝑎(1 − 1 + 𝑞 + 𝑞) − 2𝑝𝑞𝑑 𝐺𝐴1𝐴1 = 2𝑞(𝑎 − 𝑑𝑝) Subsequently, 𝐺𝐴1𝐴2 = 𝑎(𝑞 − 𝑝) + 𝑑(1 − 2𝑝𝑞) and 𝐺𝐴2𝐴2 = −2𝑝(𝑎 + 𝑞𝑑) 34 BREEDING (Additive) VALUES (A) An individual’s breeding value can be said to be the sum of the additive effects of the individual’s alleles. The concept of additive effects arises from the fact that parents pass on their alleles to their progeny and not their genotype. Therefore, the value of an individual judged by the mean value of its progeny is called the individual’s breeding value. The breeding value for an individual at a locus is defined as the sum of the additive effects of the alleles at the locus. Allelic value of A1 (α1) An A1 gametes can combine at random with either A1 or A2 to produce A1A1 with genotypic value +a or A1A2 with genotypic value d. Taking into account the proportions in which they occur, the allelic value of A1 = pa + qd The mean deviation of the progeny from the population mean is: 𝑝𝑎 + 𝑞𝑑 − 𝜇𝐺 = 𝑝𝑎 + 𝑞𝑑 − [𝑎(𝑝 − 𝑞) + 2𝑑𝑝𝑞 = 𝑞[𝑎 + 𝑑(𝑞 − 𝑝)] [Note: p+1=1; and 1-2p=p+q-2p=q-p] Allelic value of A2 (α2) An A2 gametes which can combine at random with either A2 or A1 to produce A2A2 with genotypic value -a or A1A2 with genotypic value d. Taking into account the proportions in which they occur, the allelic value of A2 = -qa + pd The mean deviation of the progeny from the population mean is: −𝑞𝑎 + 𝑝𝑑 − 𝜇𝐺 = −𝑞𝑎 + 𝑝𝑑 − [𝑎(𝑝 − 𝑞) + 2𝑑𝑝𝑞 = −𝑝[𝑎 + 𝑑(𝑞 − 𝑝)] When there are only two alleles at a locus, it is more convenient to express their additive effects in terms of the additive or average effect of allele substitution. 𝛼1 = 𝑞[𝑎 + 𝑑(𝑞 − 𝑝)] 𝛼2 = −𝑝[𝑎 + 𝑑(𝑞 − 𝑝)] The effect of substituting one allele with the other is 𝛼 = 𝛼1 − 𝛼2 this is, the average change in the genotypic value when the A1 allele is completely substituted with the A2 allele. 𝛼 = 𝛼1 − 𝛼2 = 𝑞𝑎 + 𝑑𝑞 2 − 𝑑𝑝𝑞 + 𝑝𝑎 + 𝑑𝑝𝑞 + 𝑑𝑝2 = 𝑞𝑎 + 𝑝𝑎 + 𝑑𝑞 2 − 𝑑𝑝2 𝛼 = 𝑎(𝑝 + 𝑞) + 𝑑(𝑞 2 − 𝑝2 ) Note that 𝑝 + 𝑞 = 1, 𝑎𝑛𝑑 (𝑞 2 − 𝑝2 ) = (𝑞 + 𝑝)(𝑞 − 𝑝) 𝛼 = 𝑎 + 𝑑(𝑞 − 𝑝) 35 An individual’s breeding value A is the sum of all additive effects of its alleles. When mating is random, the breeding value of a genotype for an individual is twice the expected mean deviation of its progeny from the population mean. The deviation is multiplied by two since only one half of the parental alleles are transmitted to each progeny. Therefore, we can estimate the breeding value of an individual by mating it to random individuals from the population and taking the twice the deviation of its offspring mean from the population mean. Breeding values can be estimated under several scenarios. The breeding values are: 2𝛼1 𝑖𝑓 𝑔𝑒𝑛𝑜𝑡𝑦𝑝𝑒 𝑖𝑠 𝐴1 𝐴1 𝐴𝑖𝑗 = { 𝛼1 + 𝛼2 𝑖𝑓 𝑔𝑒𝑛𝑜𝑡𝑦𝑝𝑒 𝑖𝑠 𝐴1 𝐴2 2𝛼2 𝑖𝑓 𝑔𝑒𝑛𝑜𝑡𝑦𝑝𝑒 𝑖𝑠 𝐴2 𝐴2 Mean breeding value: The summation of the breeding value multiplied by the frequency for each genotype will provide the mean breeding value. Frequency Breeding value Mean breeding value 𝐴1 𝐴1 𝑝2 2𝑞𝛼 𝐴1 𝐴2 𝐴2 𝐴2 2𝑝𝑞 𝑞2 (𝑞 − 𝑝)𝛼 −2𝑝𝛼 𝟐 𝟐 𝟐𝒑 𝒒𝜶 + 𝟐𝒑𝒒(𝒒 − 𝒑)𝜶 − 𝟐𝒑𝒒 𝜶 𝐴̅ = 2𝑝𝑞𝛼 (𝑝 + 𝑞 − 𝑝 − 𝑞) = 0 Dominance deviation (D) From the genetic model, we can calculate the dominance deviation as: 𝐷 =𝐺−𝐴 Since we have already derived both G and A, we can deduce D. Dominance deviation arise from interaction between alleles at a locus. In the absence of dominance, G=A. Let’s write G in terms of α 𝐺𝐴1𝐴1 = 2𝑞(𝑎 − 𝑝𝑑), 𝑎𝑛𝑑 𝛼 = 𝑎 + 𝑑(𝑞 − 𝑝) 𝑎 = 𝛼 − 𝑑𝑞 + 𝑑𝑝 𝐺𝐴1𝐴1 𝐺𝐴1𝐴1 = 2𝑞𝑎 − 2𝑝𝑞𝑑 = 2𝑞(𝛼 − 𝑑𝑞 + 𝑑𝑝) − 2𝑝𝑞𝑑 = 2𝑞𝛼 − 2𝑑𝑞 2 + 2𝑝𝑞𝑑 − 2𝑝𝑞𝑑 36 Therefore, 𝐺𝐴1𝐴1 = 2𝑞(𝛼 − 𝑞𝑑) Subsequently, 𝐺𝐴1𝐴2 = (𝑞 − 𝑝)𝛼 + 2𝑝𝑞𝑑 and 𝐺𝐴2𝐴2 = −2𝑝(𝛼 + 𝑝𝑑) Frequency Genotypic value, G Breeding value, A Dominance, D=G-A Mean Dominance 𝐴1 𝐴1 𝐴1 𝐴2 𝐴2 𝐴2 2 2𝑝𝑞 𝑝 𝑞2 (𝑞 − 𝑝)𝛼 + 2𝑝𝑞𝑑 2𝑞(𝛼 − 𝑞𝑑) = −2𝑝(𝛼 + 𝑝𝑑) 2𝑞𝛼 (𝑞 − 𝑝)𝛼 −2𝑝𝛼 2 2𝑝𝑞𝑑 −2𝑞 𝑑 −2𝑝2 𝑑 −𝟐𝒑𝟐 𝒒𝟐 𝒅 +𝟒𝒑𝟐 𝒒𝟐 𝒅 −𝟐𝒑𝟐 𝒒𝟐 𝒅 = 0 COMPONENTS OF GENETIC VARIATION Genetics as a subject focuses on variability on several levels. Without variability, there is nothing to study. It is therefore important to quantify variability and partition the variability into its components. A single locus with two alleles provides us with three genotypes. We can therefore compute the genotypic variation. Estimation of variation: In general we study variation by estimating the variance. Variance can be estimated as: 𝜎 2 = ∑𝑓𝑖 𝑋𝑖2 − ( ∑(𝑓𝑖 𝑋𝑖 2 ) ∑𝑓𝑖 or = ∑𝑓𝑖 𝑋𝑖2 − 𝜇2 (∑(𝑋 )2 𝜎 2 = ∑𝑋𝑖2 − 𝑁 𝑖 or 2 𝜎 = ∑(𝑋𝑖 − 𝜇)2 2 However, if 𝑋𝑖∗ = 𝑋𝑖 − 𝜇 then 𝜎𝑋∗ = ∑𝑓𝑖 𝑋𝑖∗2 37 GENOTYPIC VARIATION The genotypic variance, 𝜎𝐺2 can be estimated as: 𝜎𝐺2 = ∑(𝑓𝑖𝑗 𝐺𝑖𝑗2 ) − 𝜇𝐺2 Since we have already calculated 𝐺𝑖𝑗 as a deviation from the population mean 𝜇, then, 𝜎𝐺2 = ∑(𝑓𝑖𝑗 𝐺𝑖𝑗2 ) 2 2 2 𝜎𝐺2 = 𝑝2 𝐺𝐴1𝐴1 + 2𝑝𝑞𝐺𝐴1𝐴2 + 𝑞 2 𝐺𝐴2𝐴2 2𝑞(𝛼 − 𝑞𝑑) 𝐺𝑖𝑗 = { (𝑞 − 𝑝)𝛼 + 2𝑝𝑞𝑑 −2𝑝(𝛼 + 𝑝𝑑) 𝑖𝑓 𝑔𝑒𝑛𝑜𝑡𝑦𝑝𝑒 𝑖𝑠 𝐴1 𝐴1 𝑖𝑓 𝑔𝑒𝑛𝑜𝑡𝑦𝑝𝑒 𝑖𝑠 𝐴1 𝐴2 𝑖𝑓 𝑔𝑒𝑛𝑜𝑡𝑦𝑝𝑒 𝑖𝑠 𝐴2 𝐴2 Thus, 𝜎𝐺2 = 𝑝2 [2𝑞(𝛼 − 𝑞𝑑]2 + 2𝑝𝑞[(𝑞 − 𝑝)𝛼 + 2𝑝𝑞𝑑]2 + 𝑞 2 [−2𝑝(𝛼 + 𝑝𝑑)]2 𝜎𝐺2 = 2𝑝𝑞𝛼 2 + (2𝑝𝑞𝑑)2 Partitioning of the Genetic Variance Earlier on we defined 𝐺 =𝐴+𝐷 The genetic model contains both the additive and dominance values. The variance of G is: 𝜎𝐺2 = 𝜎𝐴2 + 𝜎𝐷2 + 2𝐶𝑜𝑣𝐴𝐷 In a population under Hardy-Weinberg equilibrium (without inbreeding), the covariance between the breeding value and dominance deviation is zero. 𝐶𝑜𝑣𝐴𝐷 = ∑(𝑓𝑖𝑗 𝐴𝑖𝑗 𝐷𝑖𝑗 ) = [(𝑝2 )(2𝑞𝛼)(−2𝑞 2 𝑑)] + [(2𝑝𝑞)((𝑞 − 𝑝)𝛼)(2𝑝𝑞𝑑)] + [(𝑞 2 )(−2𝑝𝛼)(−2𝑝2 𝑑)] = −4𝑝2 𝑞 3 𝛼𝑑 + 4𝑝2 𝑞 2 (𝑞 − 𝑝)𝛼𝑑 + 4𝑝3 𝑞 2 𝛼𝑑 𝐶𝑜𝑣𝐴𝐷 = 4𝑝2 𝑞 2 𝛼𝑑(−𝑞 + 𝑞 − 𝑝 + 𝑞) = 0 Therefore, we can drop the covariance from the above model. Therefore, 𝜎𝐺2 = 𝜎𝐴2 + 𝜎𝐷2 38 Additive genetic variance, 𝝈𝟐𝑨 We can use the same logic used in calculating the genetic variance to calculate the additive genetic variance. Since we have already calculated 𝐴𝑖𝑗 as a deviation from the population mean 𝜇, then, 𝜎𝐴2 = ∑(𝑓𝑖𝑗 𝐴2𝑖𝑗 ) 2𝛼1 = 2𝑞𝛼 𝑖𝑓 𝑔𝑒𝑛𝑜𝑡𝑦𝑝𝑒 𝑖𝑠 𝐴1 𝐴1 𝐴𝑖𝑗 = {𝛼1 + 𝛼2 = (𝑞 − 𝑝)𝛼 𝑖𝑓 𝑔𝑒𝑛𝑜𝑡𝑦𝑝𝑒 𝑖𝑠 𝐴1 𝐴2 2𝛼2 = −2𝑝𝛼 𝑖𝑓 𝑔𝑒𝑛𝑜𝑡𝑦𝑝𝑒 𝑖𝑠 𝐴2 𝐴2 𝜎𝐴2 = 𝑝2 (2𝑞𝛼)2 + 2𝑝𝑞[(𝑞 − 𝑝)𝛼]2 + 𝑞 2 (−2𝑝𝛼)2 𝜎𝐴2 = 4𝑝2 𝑞 2 𝛼 2 + 2𝑝𝑞(𝑞 − 𝑝)2 𝛼 2 + 4𝑝2 𝑞 2 𝛼 2 = 2𝑝𝑞𝛼 2 (2𝑝𝑞 + 𝑞 2 − 2𝑝𝑞 + 𝑝2 + 2𝑝𝑞) 2𝑝𝑞𝛼 2 (𝑝2 + 2𝑝𝑞 + 𝑞 2 ) 𝜎𝐴2 = 2𝑝𝑞𝛼 2 Dominance variance, 𝝈𝟐𝑫 We have already calculated 𝐷𝑖𝑗 as a deviation from the population mean 𝜇, therefore, 𝜎𝐷2 = ∑(𝑓𝑖𝑗 𝐷𝑖𝑗2 ) −2𝑞 2 𝑑 𝐷𝑖𝑗 = { 2𝑝𝑞𝑑 −2𝑝2 𝑑 𝑖𝑓 𝑔𝑒𝑛𝑜𝑡𝑦𝑝𝑒 𝑖𝑠 𝐴1 𝐴1 𝑖𝑓 𝑔𝑒𝑛𝑜𝑡𝑦𝑝𝑒 𝑖𝑠 𝐴1 𝐴2 𝑖𝑓 𝑔𝑒𝑛𝑜𝑡𝑦𝑝𝑒 𝑖𝑠 𝐴2 𝐴2 𝜎𝐷2 = 𝑝2 (−2𝑞2 𝑑)2 + 2𝑝𝑞(2𝑝𝑞𝑑)2 + 𝑞 2 (−2𝑝2 𝑑)2 = 4𝑝2 𝑞 4 𝑑 2 + 8𝑝3 𝑞 3 𝑑2 + 4𝑝4 𝑞 2 𝑑2 = 4𝑝2 𝑞 2 𝑑 2 (𝑞 2 + 2𝑝𝑞 + 𝑝2 ) 𝜎𝐷2 = (2𝑝𝑞𝑑)2 𝜎𝐺2 = 2𝑝𝑞𝛼 2 + (2𝑝𝑞𝑑)2 39 Fig 10 The genotypic (VG), additive (VA) and dominance (VD) variances at different allele frequency If there is no dominance (d=0), the dominance variance, 𝜎𝐷2 = 0, resulting in 𝜎𝐺2 = 𝜎𝐴2 .If there is complete dominance (d=a) the additive variance becomes, 𝜎𝐴2 = 8𝑝𝑞 3 𝑎2 𝜎 2 = ½𝑎2 𝑤ℎ𝑒𝑛 𝑝 = 𝑞 = 0.5 { 𝐴2 𝜎𝐷 = ¼𝑑 2 40 Egg weight Genotypic value, G Genotypic frequency, f Genetic parameter estimations under different allele frequency 𝑞 = 0.1 𝑞 = 0.5 𝑞 = 0.8 𝐴1 𝐴1 𝐴1 𝐴2 𝐴2 𝐴2 𝐴1 𝐴1 𝐴1 𝐴2 𝐴2 𝐴2 𝐴1 𝐴1 𝐴1 𝐴2 50 45 30 50 45 30 50 45 10 5 -10 10 5 -10 10 5 0.81 0.18 0.01 0.25 0.50 0.25 0.04 0.32 Population mean=𝑎(𝑝 − 𝑞) + 2𝑝𝑞𝑑 𝛼 = 𝑎 + 𝑑(𝑞 − 𝑝) Additive effect 𝐴1 = 𝑞𝛼 𝐴2 = −𝑝𝛼 Breeding value, A Mean breeding value Dominance Deviation, D Mean dominance deviation Additive variance Dominance variance Genetic variance 𝐴2 𝐴2 30 -10 0.65 8.9 6 2.5 10 -4.4 13 0.6 -5.4 5 -5 11.7 -2.6 1.2 0.972 -4.8 -0.864 -10.8 -0.108 10 2.5 0 0 -10 -2.5 20.8 0.832 7.8 2.496 -5.2 -3.328 -0.1 -0.081 0.9 0.162 -8.1 -0.081 -2.5 -0.625 2.5 1.25 -2.5 -0.625 -6.4 -0.256 1.6 0.512 -0.4 -0.256 6.48 0.81 7.29 50 6.25 56.25 41 54.08 2.56 56.64 MOLECULAR GENETICS APPLIED TO ANIMAL BREEDING GENOME ORGANIZATION What is a genome? A genome is an organism’s complete set of DNA, including all of its genes. Each genome contains all of the information needed to build and maintain that organism. The genome is made up of the DNA in chromosomes as well as the DNA in mitochondria. The genome contains instructions or blue print for all activity in an organism. The instructions are written in a four-letter-language of DNA, i.e. Adenine, Cytosine, Thymine and Guanine, shorten to A, C, T, and G). Almost every cell in an eukaryotic organism contains a complete copy of these instructions. The genetic instructions are stored in pairs of chromosomes. Each chromosome contains genes which contains the direct instructions for a cell to make a protein. The genome contains coding sequences (genes) and non-coding sequences of DNA. 42 The genome contains: 1. STRUCTURAL GENES: DNA segments that codes for some specific RNAs or proteins. Encodes for mRNAs, tRNA, snRNAs, scRNAs, etc 2. FUNCTIONAL SEQUENCES: Regulatory sequences-occur as regulatory elements (initiation sites, promotor regions, terminator regions, etc) 3. NON-FUNCTIONAL SEQUENCES: Introns, repetitive sequences, and all the unknowns DNA: Double stranded helical structure NUCLEOSOME: DNA is complexed with histones. Each nucleosome consist of eight histones proteins around which the DNA wraps 1.65 times. CHROMATOSOME: A nucleosome plus H1 histone. Nucleosomes fold up to produce a 30 nm fiber that forms loops averaging 300 nm in height, which are compressed and folded to produce a 250-nm wide fiber. The tight coiling of the 250 nm fiber produces the chromatid of a chromosome We can all agree with these noble hard working scientists that the genome is very complex and may never grasp all the complexity. Our knowledge about the genome keeps improving. There are so many unanswered questions. 43 We know about 5-10% of the genome encodes for genes. What is the function of the other 90%? So far there are no good answers. In the 1990’s, the non-coding regions were referred to as junk DNA, but nobody uses the term junk DNA anymore our knowledge of the genome keeps improving, and some of the so called junk DNA have elements that the controls gene transcription. Non-coding RNA, e.g. microRNA depending on the location can affect gene transcription. A fairly balanced article on junk DNA post ENCODE era and the controversy that ensued can be found in PLoS Genetics http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004351 THE DOUBLE HELIX Deoxyribonucleic Acid (DNA) has double stranded helix structure and it encodes the genetic instructions used in the development and function of all known living organisms and many viruses. The two strands of DNA run in opposite direction to each other. Attached to each sugar is one of four nucleobases. It is the sequence of these four nucleobases along the backbone that encodes genetic code or biological information. The four nucleobases are two purines (Adenine and Guanine) and two pyrimidines (Cytosine and Thymine). In the double helix structure, adenine bonds with thymine (A-T) and guanine bonds with cytosine (C-G). Under the genetic code, RNA strands are translated to specify the sequence of amino acids within proteins. The RNA strands are initially created using DNA strands as a template in a process called transcription. 44 Ribonucleic acid (RNA), unlike DNA is single stranded the folds onto itself rather than a paired double strand. In RNA, the pyrimidine, thymine is replaced by uracil. One of the universal functions of RNA is protein synthesis where messenger RNA (mRNA) molecules direct the assembly of proteins on ribosomes. This process uses transfer RNA (tRNA) molecules to deliver amino acids to the ribosome, where ribosomal RNA (tRNA) links amino acids together to form proteins. GENE A gene was defined at least four decades before the DNA structure was discovered. To a population geneticist, a gene is the basic unit of heredity which comes in pairs, and one pair is transmitted from parent to progeny. A more refined definition of a gene will be a sequence (instruction manual) on a chromosome that encodes a protein or a polypeptide. A gene consist of a 5' untranslated region (5' UTR) or leader sequence that ends to the position of the first codon used in translation. The 3' UTR is the portion of an mRNA from the 3' end of the mRNA (trailer sequence) to the position of the last codon used in translation. The frame of a gene consists of exons and introns. An exon is any nucleotide sequence encoded by a gene that remains within the final mature RNA product of that gene. An intron is a noncoding part of a gene that is spliced out before the RNA is translated into a protein. 45 46 47 MOLECULAR MARKERS What is the composition of the intergenic noncoding part of the genome? Genome Studies 1. Improve annotation of the genome 2. Function and regulation of coding genes 3. Posttranslational regulation of genes 4. Extract potential functions from non-coding and intergenic DNA For Animal and Poultry Breeding 1. Map quantitative trait loci 2. Identify genes associated with traits of economic importance 3. Estimation of genome breeding values 4. Genetic diversity 5. Gene flow 6. Population studies 7. Epidemiological studies 8. Domestication 9. Toxicity and many others 48 To date, a large proportion of genome studies have been possible because of genetic markers. GENETIC MARKER: DNA sequence that can be detected and whose inheritance can be monitored. The three properties that define a genetic marker are: locus specificity, polymorphic and ease of genotyping. A marker is said to be polymorphic when it exits in more than one form Types of genetic markers 1. Restricted fragment length polymorphism (RFLP) 2. Variable number of tandem repeats (VNTR) a. Minisatellites b. DNA fingerprinting c. microsatellites 3. Sequenced tagged sites (STS) and expressed sequence tags (EST) 4. Random amplified polymorphic DNA (RAPD) 5. Amplified fragment length polymorphism (AFLP) 6. Single stranded conformation polymorphism (SSCP) 7. PCR amplification of specific alleles (PASA) 8. Copy number variation (CNV) 9. Single nucleotide polymorphism (SNP) a. Anonymous SNP (No known effect on gene function-have been used extensively in gene mapping, linkage disequilibrium and diversity studies) b. cSNP (located within protein coding sequence (May interfere with gene function by altering the amino acid sequence c. Candidate SNP- SNP thought to have putative functional effect d. rSNP (SNP in the regulatory region of a gene; the regulatory region effect gene expression, e.g. A mutation in the 5' UTR of the endoglin gene affects the translational initiation and alter the reading frame in hereditary hemorrhagic telangiectasia (vascular disorder) e. pSNP (When a phenotype is changed as a result of altered protein function, cSNP or rSNP may be labelled a pSNP. f. Synonymous SNP (When a base pair change occurs in a cSNP, but the cSNP still codes for the same amino acid. There are several laboratory methods used to detect the aforementioned genetic markers. Those methods would not be the subject of this course. The most commonly used markers in farm animal studies are microsatellites and SNPs. 49 50 51 SELECTON THEORY Selection response (R) is how much gain you make when mating of selected parents. Response to selection can be evaluated in the short- or long-term. Success of the selection decisions depend on a number of factors: 1. 2. 3. 4. 5. How heritable is the trait under selection (i.e. the trait in the breeding goal)? How much genetic variation for that trait is there in the population? What is the average accuracy of the EBV, and thus the accuracy of selection? What proportion of the animals will be selected for breeding? In case genetic gain is to be expressed per year, rather than per generation: how long is a generation? To optimize the success of a breeding program it is important to balance the relatively short-term decisions: acquire high genetic gain, and the long term maintenance of the population: controlling rate of inbreeding. SHORT-TERM RESPONSE: Predict a few generations of selection response when the base population (generation 0) additive genetic variance (heritability) is sufficient to make satisfactory prediction using the breeders’ equation (Lush, 1937) 𝑅 = ℎ2 𝑆 LONG-TERM RESPONSE: As selection proceeds, allele frequency changes and the base population genetic parameters fails to predict long term response. CHANGES IN THE MEAN: The within-generation mean: This reflects the changes in the entire population and that of the selected population. Selection can cause changes in the distribution of phenotypes. The withingenerational change is what is referred to as the Selection Differential, (S). The within-generational change is the means due to selection is: 𝑆 = 𝑋𝑠 − 𝑋0 Where 𝑋0 is the population mean (Generation 0) before selection and 𝑋𝑠 is the mean of the selected parents that produces the progeny population (Generation 1). 52 The between-generation mean: This is the response to selection, R which measures the changes in mean between the population before and after selection. 𝑅 = 𝑋1 − 𝑋0 Where 𝑋1 is the population mean (Generation 1) before selection. Weighted selection differential: The joint effects of natural and artificial selection affect selection response. Natural selection is always on the side of fitness and can be in the same direction or oppose artificial selection. Important assumption in evaluating predictions of genetic gain: environmental influences remain constant across generations Let’s examine the unweighted and weighted selection differential and ascertain how they are influenced by natural selection. Data from a long term selection program: 1. Calculate the Unweighted selection differential 2. Calculate the Weighted selection differential 3. Where the direction of national selection 53 Population mean Mating 1 2 3 4 5 6 7 8 9 10 Male (ram) 24 kg Female (ewe) 22 kg 22 35 23 20 24 30 30 37 22 19 20 29 22 24 20 27 30 22 20 20 # of offspring measured 2 1 1 2 2 2 0 0 6 10 N=26 Prediction of response to selection from the proportion selected: Selection intensity (i) The selection differential is limiting when comparing the strength of selection on different traits or in different populations. When planning a selection program, it would be rather useful to predict genetic change from certain selection strategy prior to even selecting the parental population to breed. This is possible when truncation selection (selection of individuals above or below a certain truncation point or threshold) is practiced. The selection differential can be derived from the distribution of predicted breeding values or phenotypic values and knowledge of the proportion of selected individuals. The standardized selection differential, usually called the selection intensity (i) is the selection differential expressed as a fraction of the phenotypic standard deviation. The selection intensity is a more useful measure for predicting selection response or comparing different selection strategies or response in different populations. 𝑖= 𝑆 𝜎𝑝 Where 𝜎𝑝 is the phenotypic standard deviation of the trait: This implies, 𝑆 = 𝑖𝜎𝑝 The breeders’ equation can therefore be written as: 𝑅 = 𝑖ℎ2 𝜎𝑝 The breeder’s equation theoretically holds for a single generation of selection from an unselected bas population. The reliability of using the breeder’s equation to predict response to selection beyond one generation depends on: 54 1. The accuracy of the heritability estimate 2. Absence of environmental changes between generations 3. Insignificant change in the heritability estimate from that of the base population From population genetics, we learned that heritability depends on allele frequency. Selection changes allele frequency. Therefore, it should be expected that, heritability will change with selection. Thus, in the strictest sense, the breeder’s equation is valid only for one generation. However, heritability is not expected to change significantly in the first few generations of selection and in practice, the breeders’ equation has been used to predict short term response (up to 3-5 generations of selection. Accuracy The breeders’ Equation can be extended beyond choosing an individual solely on the basis of its phenotype. 𝜎𝐴2 𝜎𝐴 ℎ 𝜎𝑝 = 2 𝜎𝑝 = ( ) 𝜎𝐴 = ℎ𝜎𝐴 𝜎𝑝 𝜎𝑝 2 We can rewrite the response to selection equation as: 𝑅 = 𝑖 ℎ𝜎𝐴 Where h is the correlation between the phenotypic and breeding values; ℎ = 𝑟𝐴𝑃 which quantifies the ability to predict the breeding value of an individual from the individual’s phenotype. This is in essence the accuracy of the selection scheme used to select parents. We can therefore express the breeders’ equation in terms of accuracy of selection as: 𝑅 = 𝑖 𝑟𝐴𝑃 𝜎𝐴 𝑅𝑒𝑠𝑝𝑜𝑛𝑠𝑒 = 𝑖𝑛𝑡𝑒𝑛𝑠𝑖𝑡𝑦 ∗ 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑜𝑓 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑛𝑔 𝐵𝑉 ∗ 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝐵𝑉 1. Single measurement on an animal The EBV of an animal can be estimated by regressing the animal’s BV on its phenotype. With a single measurement on an animal, the regression coefficient, 𝑏𝐴𝑃 equals the heritability ℎ2 : 𝑏𝐴𝑃 = 𝜎𝐴𝑃 𝜎𝐴2 = 2 = ℎ2 𝜎𝑝 𝜎𝑃2 55 The EBV, 𝐴̂ of an animal is 𝐴̂ = ℎ2 (𝑃 − 𝑃̅) and 𝐴𝑐𝑐 = √𝑏𝐴𝑃 𝑥 𝑔 = √ℎ2 𝑥 𝑔 Where P is the phenotypic value of the trait, 𝑃̅ is the population mean, and g the relationship between the individual(s) being measured and the individual for which we are estimating BV. The value of g is 1.0 for an individual's own performance. It is 0.5 for full sibs, progeny or parents and 0.25 for half sibs or grandparents. Example 1: Daily feed consumption (FC) of two individuals A and B are 125g and 135g respectively. The mean FC is 120g, with heritability of 0.20. Predict the EBV and accuracy of A and B for FC. A: EBV=ℎ2 (𝑃 − 𝑃̅) = 0.20 x (128-120) = 1.6 g Acc=√ℎ2 𝑥 𝑔 = = √0.20 𝑥 1 = 0.45 B: EBV=ℎ2 (𝑃 − 𝑃̅) = 0.20 x (135-120) = 3.0 g Acc=√ℎ2 𝑥 𝑔 = = √0.20 𝑥 1 = 0.45 Individual B has a higher EBV for FC than A, but both estimates have the same accuracy. 2. Repeated measurement on an animal Some traits can be measured several times during an animal's lifetime. For example feed consumption, body weight, egg production. If a trait is measured several times during an animal's life, each value should be used in an estimate of breeding value. The relationship between repeated records, termed “repeatability” becomes important. Repeatability (re) is a measure of the reliability or strength of the relationship between repeated measurements on an individual. When using repeated measurements on an individual g is still 1.0 since the animal being measured and the animal the BV is obtained for are still the same. The value of 𝑏𝐴𝑃 is now a function of the number of records (n), heritability (h2) and repeatability (re). With repeated measurements on an animal: 𝑛ℎ2 𝑛ℎ2 𝑏𝐴𝑃 = 1+(𝑛−1)𝑟 and 𝐴𝑐𝑐 = √1+(𝑛−1)𝑟 𝑥 𝑔 𝑒 𝑒 56 Example 2: Assume that the daily feed intake of individual A (128 g) is an average of 5 measurements, with a repeatability of 0.40. Predict the EBV and accuracy of A. 𝑛ℎ2 5 𝑥 0.20 (𝑃 − 𝑃̅ ) = 𝐸𝐵𝑉 = 𝑥 (128 − 120) = 3.08 1 + (𝑛 − 1)𝑟𝑒 1 + (5 − 1)𝑥0.40 𝑛ℎ2 5 𝑥 0.20 𝐴𝑐𝑐 = √1+(𝑛−1)𝑟 𝑥 𝑔 = √1+(5−1)𝑥0.40 𝑥 1.0 = 0.62 𝑒 Repeated measurements on A improve its EBV and accuracy for feed intake. Accuracy of Estimated Breeding Values for different heritability, Repeatability and number of measurements on an animal. Number of measurements Heritability Repeatability 1 5 10 0.10 0.25 0.32 0.50 0.55 0.50 0.32 0.41 0.43 0.75 0.32 0.35 0.36 0.25 0.50 0.79 0.88 0.50 0.50 0.65 0.67 0.75 0.50 0.56 0.57 0.50 0.71 0.91 0.95 0.75 0.71 0.79 0.80 0.25 0.50 Traits with low heritability benefit from multiple measurements since each additional record contributes toward to total information available, especially when the repeatability is low. If the repeatability is high, multiple measurements do not add much to the accuracy of EBV. 57 3. Information from Relatives In a closed population, there is bound to be full sibs (FS) (have both parents in common) and half sibs (HS) (have one parent in common) that provide additional information in estimating BV. Siblings have a proportion of their alleles (genes) in common. Full sibs have half of their alleles in common, and half sibs have a quarter of their alleles in common. In pig, cattle, sheep and goat, siblings are initially reared together, and the common environment among siblings also creates additional similarity (maternal environment, temperature, food supply), however, in commercial poultry similarity due to common environment is non-existent. In non-commercial poultry where the hen incubates her own eggs and brood her chicks, similarity of siblings due common environment is in play when estimating BV. The similarity among siblings, t, depends on the siblings involved. 2 𝑡𝐻𝑆 = ¼ℎ2 + 𝑐𝐻𝑆 2 𝑡𝐹𝑆 = ½ℎ2 + 𝑐𝐹𝑆 where, c2 is the environmental correlation among sibs. The regression coefficient is given as: 2 2 𝑛𝑔ℎ 𝑛𝑔ℎ 𝐸𝐵𝑉 = 1+(𝑛−1)𝑡 (𝑃 − 𝑃̅) and 𝐴𝑐𝑐 = √1+(𝑛−1)𝑡 𝑥 𝑔 where n is the number of siblings, t is the correlation among sibs, g is the genetic relationship among sibs. For full sibs, g=½, and for half sibs, g=¼. Example 3: Individual A has 5 half sibs with and FC of 128 g. Predict the EBV and accuracy of A when environmental correlation c2 is (a) 0, and (b) 0.125. The population mean for FC is 120g and h2 is 0.20. Assume (c) that the 5 records were obtained from full sibs, and c2 is 0.125. (a) tHS = ¼ x 0.20 + 0 = 0.05, and g=0.25 𝑛𝑔ℎ2 5 𝑥 0.25 𝑥 0.20 (𝑃 − 𝑃̅) = 𝐸𝐵𝑉 = 𝑥 (128 − 120) = 1.67 1 + (𝑛 − 1)𝑡 1 + (5 − 1)𝑥0.05 58 𝐴𝑐𝑐 = √ (b) 𝑛𝑔ℎ2 5 𝑥 0.25 𝑥 0.20 𝑥𝑔= √ 𝑥 0.25 = 0.23 1 + (𝑛 − 1)𝑡 1 + (5 − 1)𝑥0.05 tHS = ¼ x 0.20 + 0.125 = 0.175, and g=0.25 𝐸𝐵𝑉 = 𝑛𝑔ℎ2 5 𝑥 0.25 𝑥 0.20 (𝑃 − 𝑃̅) = 𝑥 (128 − 120) = 1.18 1 + (𝑛 − 1)𝑡 1 + (5 − 1)𝑥0.175 𝑛𝑔ℎ2 5 𝑥 0.25 𝑥 0.20 𝐴𝑐𝑐 = √ 𝑥𝑔= √ 𝑥 0.25 = 0.06 1 + (𝑛 − 1)𝑡 1 + (5 − 1)𝑥0.175 When there is no measurement on the animal, EBV predicted from relatives is low. The higher the value of t the lower the EBV. (c) tFS = ½ x 0.20 + 0.125 = 0.225, and g=0.50 𝐸𝐵𝑉 = 𝑛𝑔ℎ2 5 𝑥 0.50 𝑥 0.20 (𝑃 − 𝑃̅) = 𝑥 (128 − 120) = 2.11 1 + (𝑛 − 1)𝑡 1 + (5 − 1)𝑥0.225 𝑛𝑔ℎ2 5 𝑥 0.50 𝑥 0.20 𝐴𝑐𝑐 = √ 𝑥𝑔= √ 𝑥 0.50 = 0.11 1 + (𝑛 − 1)𝑡 1 + (5 − 1)𝑥0.225 Sib information never results in really high accuracy. Full sib information is limited by environmental correlations among the sibs. It should not replace individual’s own record if it can be obtained. Rather, it should be used to supplement the information on the individual if sib information happens to be available. 59 Progeny testing Using the mean of a parent’s progeny to predict the parent’s breeding value, is an alternative predictor of an individual’s breeding value. The correlation between the mean of n progeny, and the breeding value of the parent is 𝑟𝐴𝑃 4 − ℎ2 𝑤ℎ𝑒𝑟𝑒 𝑎 = ℎ2 𝑛 =√ , 𝑛+𝑎 𝑟𝐴𝑃 = √ 𝑛ℎ2 4 + ℎ2 (𝑛 − 1) Example: A breeder selects top 20% of sheep based on performance of 10 offspring. The heritability of udder size is 0.10, with a phenotypic variance of 50. Predict the response to selection that the breeder will achieve with this strategy. A selected proportion of 20% results in a selection intensity of 1.4. 𝑟𝐴𝑃 = √ 10 𝑥 0.10 4 + 0.10(10 − 1) The breeder is disappointed and wants more genetic gain. Predict how much improvement he can achieve be achieved by selecting the top 10% instead of the top 20% for breeding. What changed? The breeder is still not completely satisfied because he wants a genetic gain and decides to base the selection on the performance of 15 instead of 10 offspring. Predict the selection response for this new situation. What changed? From Response per generation to Response per year The breeders’ equation thus far calculates response to selection per generation. However, to In quantitative genetics, generation intervals are generally defined as the average age of parents at birth of their offspring. In this definition, generation interval is based on the contributions of parental age classes to newborn offspring; i.e., the average age of parents is calculated as the sum of ages at birth of offspring weighted by the contribution of each age class to newborn offspring. This approach is adopted in the well-known gene flow procedure (Hill 1974). calculate the selection response per year, the generation interval is required. The breeders’ equation can be calculated as: 60 𝑅𝑦𝑟 = 𝑖 𝑟𝐴𝑃 𝜎𝐴 𝐿 The generation interval L can be calculated separately for males and females and averaged. Equal numbers of 2 and 3 year old bulls selected as parents: 𝐿𝑚𝑎𝑙𝑒𝑠 = 2.5 𝑦𝑒𝑎𝑟𝑠 Equal numbers of 2, 3 and 4 year old cows selected as parents: 𝐿𝑓𝑒𝑚𝑎𝑙𝑒𝑠 = 3.0 𝑦𝑒𝑎𝑟𝑠; 𝐿𝑎𝑣𝑒𝑟𝑎𝑔𝑒 = 2.75 𝑦𝑒𝑎𝑟𝑠; Age Male Female 2 10 200 Age structure of animals selected for breeding 3 4 5 7 3 175 100 25 𝐿𝑚𝑎𝑙𝑒 = 𝐿𝑓𝑒𝑚𝑎𝑙𝑒 = TOTAL 20 500 (10𝑥2) + (7𝑥3) + (3𝑥4) = 2.65 𝑦𝑟 10 + 7 + 3 (200𝑥2) + (175𝑥3) + (100𝑥4) + (25𝑥5) = 2.90 𝑦𝑟 200 + 175 + 100 + 25 𝐿𝑎𝑣𝑒𝑟𝑎𝑔𝑒 = 2.65 + 2.90 = 2.775 𝑦𝑟 2 High selection intensity means high generation interval, and low selection intensity means low generation interval. This does not fit well with maximizing i/L. i/L should be OPTIMIZED Optimizing genetic gain will require a balance between increase of the accuracy and increase of the generation interval 61 Selection Path The selection strategy of males and females are different. The major differences between the sexes are: 1. In mammals there is a limited reproduction capacity in females. We assume that population size is the same across generations. We should be aware that, selected animals should be capable to produce sufficient progeny to maintain population size. Males generally can produce more progeny than female and as a result, selection intensity is higher in males than females. We should also be mindful of the direction of natural selection to ensure that sufficient progeny is produced. 2. The information sources for estimating breeding values in males and females may be different. Males may be selected based on progeny performance, whereas females are selected on their own performance leading to differences in accuracy of selection. 3. The generation interval for the sexes may also be different. If males re selected based on progeny testing, then on the average, the age at which males will be used for breeding will be different from that of females. The aforementioned differences in males and females require different selection paths when determining response to selection per year. The breeders’ equation can be written as: 𝑅𝑦𝑟 = 𝑅𝑚 + 𝑅𝑓 𝑖𝑚 𝑟𝐴𝑃,𝑚 𝜎𝐴 + 𝑖𝑓 𝑟𝐴𝑃,𝑓 𝜎𝐴 = 𝐿𝑚 + 𝐿𝑓 𝐿𝑚 + 𝐿𝑓 The intensity of selection and accuracy of selection and generation interval may be different in males and females. The genetic standard deviation, however, is a population parameter and is, therefore, the same between males and females. A sheep breeder has 200 ewe flock and selecting for weaning weight. Rams are first selected at 2 years old and mated for 3 years. Ewes are first selected at 2 years old, and mated for 5 years. Each ram is mating to 20 ewes, 80% lambing rate, 50:50 sex ratio, and there is no significant mortality in adults. The heritability =0.11 and the phenotypic variance is 0.25 kg. Calculate the response to selection per year. Age structure of animals selected for breeding Age 2 3 4 5 6 TOTAL Male 5 5 10 Female 40 40 40 40 40 200 200 ewes, 80% lambing rate means 160 lambs in total (80 of each sex). Select 5 out of 80 males each year. The proportion is 5/80=6.25% corresponding to selection intensity, i of ~1.98. Select 40 out of 80 females each year. The proportion is 40/80=50%, corresponding to selection intensity i of 0.798. Calculate the response to selection per year. 62 We can define four selection paths: Sires to breed sires (SS) This is the most stringent selection path to breed new fathers of the fathers. Only elite sires make it to sire father. Sires to breed dams (SD) Within the sires this is a less stringent selection path. These sires will be the fathers of the breeding females (the dams). Dams to breed sires (DS) This is the most stringent selection path within the dams to breed new sires. Only the elite dams will make it to sire mother. Dams to breed dams (DD) This is the least stringent selection path. It depends on the studbook whether there are selection criteria for new dams. 𝑅𝑦𝑟 = 𝑅𝑆𝑆 + 𝑅𝑆𝐷 + 𝑅𝐷𝑆 + 𝑅𝐷𝐷 𝐿𝑆𝑆 + 𝐿𝑆𝐷 + 𝐿𝐷𝑆 + 𝐿𝐷𝐷 Selection response can be divided into a number of selection paths, the number depending on the number of differences in selection intensity and the accuracy of selection 63 LIVESTOCK BREEDING STRATEGIES Samuel E Aggrey, PhD University of Georgia Athens, GA 30602, USA [email protected] Several panels have been assembled in the past by governments, international agencies and nonprofit organizations to map out strategies to improve livestock productivity in developing countries. The goals have been laudable but the outcomes have been far below expected goals. Breeding strategy in the developing world has become synonymous with turning the axle of poultry and livestock production to mirror that of advanced countries. In the developing world genetic improvement has come to imply upgrading a herd usually, that of a national livestock research institute. Several crossbreeding projects were initiated all across Africa with the goal of quickly upgrading low producing indigenous and adapted breeds with high producing exotic breeds from Europe or North America. Management of crossbred herds did not match their genetic potential and as a result the expected productivity was not realized. The crossbreeding approach to genetic improvement was not done in a sustainable manner and currently only remnants of such projects exist. It should be pointed out that in a few cases, crossbreeding on private farms with improved nutrition and management has been successful but they are not enough to meet the massive demand for meat and livestock products. Genetic improvement is a long term endeavor and short term approaches are bound to yield limited or no success at all. Funding for genetic improvement projects from most international agencies only last for about 5 years. Funding from national governments could be as short as one year. A total mismatch of a long term endeavor with a very short term funding can only point in the direction of limited success if not failure. In recent times, scientific jargons have been embraced in several projects. Biotechnology is the silver bullet expected to radically transform the whole agricultural sector in the developing world. The argument here is not about the potential of biotechnology. When a high powered fuel is put into a non-functioning engine, the vehicle would still not move. All other parts of the vehicle should also be functioning. Genomics, high throughput science, biotechnology and nanotechnology when applied in the proper environment can lead to tremendous increase in productivity. However, I would argue that, before any of these advanced technologies are adopted en masse, the well proven methodologies need to be adopted first. In the developing world, breeding strategies need to have at least four basic components: 1. Assessment 2. Preplanning 3. Technical mechanics of genetic improvement 4. Sustainability 64 A. ASSESSMENT OF EXISTING SYSTEM Assessment can be done in five broad areas to answer basic questions to determine whether genetic improvement is even needed at all. 1. Current Production System a. Who are the breeders? b. Who are the animal keepers? c. What are the management practices? d. Can the current production system support and improvement program? e. Is reduction in herd size or animal numbers possible? f. What are the logistics and infrastructure? g. What is the environmental impact h. Is the current production system sustainable? 2. Existing Input and Support a. Water b. Labor c. Animal health care d. Extension e. Training support f. Research Support 3. Cultural and Social practices a. What is the cultural/societal value of animals? b. What are the significance of raising and/or keeping animals 4. Current Breeding Practices a. How do genes flow from breeding to producing animals? i. How do farmers obtain replacement animals? ii. Pure or crossbred? or no form of improvement? 5. Market Analysis a. What is the size of the overall market? b. Can the market improve or grow? c. Is there demand for the product? d. What is the purchasing power of the population? e. Are there export possibilities? f. Can the market accommodate improvement in the production system? There should be a fact based justification for genetic improvement. When there is a demand for a product, there is no need to convince producers to produce more. 65 GENETIC IMPROVEMENT IS A LONG TERM PROGRAM What we learned from past attempted programs 1. Short term funding (≤5 years) has been a colossal FAILURE. 2. Economic sustainable plan into the long term is required. 3. Genetic diversity plan (biodiversity) should be required for the long term Otherwise, do not start! B. PREPLANNING In the preplanning stage, both livestock keepers and consumers should be adequately involved in the early planning and genetic improvement programs. Some questions also need to be adequately answered at this stage. 1. Is there a demand for increased productivity? 2. Are improved animals needed by livestock keepers without exceeding their capacity to manage the animals? 3. Will increased supply of external inputs (diet, vaccines, housing, etc.) increase productivity rather than a new breed? 4. Will consumers accept a new breed, improved strain or crossbred? In most cases in Africa, livestock keepers have their own breeding criteria and any genetic improvement program should take that into account when defining the breeding objective. For example, the Karamoja pastoralist prefers coat color, body size, conformation, horn configuration and temperament as traits suitable for marketing. In Ethiopia, there is a preferred phenotypic characteristic of chickens. After all, the breeding objective should be based on projected profits under future conditions of productions and not merely on the potential to change trait genetically. The definition of profit may differ from place to place. Whereas, some places use monetary value to define profit, other may simply use herd size. It is during the preplanning stage that priorities and the sustainability plan for the entire breeding strategies should be developed. PRIORITIES a. Short terms b. Medium terms c. Long terms 1. Can the objectives of the priorities be achieved in the given time? 2. Is there any funding in place or in the future for any of the priority steps? 3. Are outcome bench marks clearly defined? 4. Can the outcomes be achieved? 66 C. TECHNICAL MECHANICS OF GENETIC IMPROVEMENT BREEDING OBJECTIVES The breeding objective is defined based on projected profits under future conditions of production, not merely on the potential to change traits genetically Breeding is always aimed at the future. Decisions you make now will influence the future generation(s). The breeding goal that you have defined indicates what you think will be important in the future. You have analyzed the market and have an idea about what customers will demand some years from now. Will it be mainly milk or butter or cheese? Will it be mainly pork chops or ham or bacon? Will it be mainly breast meat or legs or full carcasses? Finally, you have an idea about the expected developments in production systems and regulations. What are new developments related to housing systems, nutrition, etc and how are they expected to influence the performance of your animals? Has the (inter)national government announced new regulations that may limit your current production system? Should you anticipate to these upcoming changes? This means that the best animals for the future conditions of production need to be developed. How does one define “best animal”. The definition of the best animal is subjective, depending on (1) the function of the animal, (2) culture, (3) market structure, (4) production environment, (5) legislature (6) population structure [pyramidal or segmented] and (7) environment limitations. Cattle are kept for meat, milk and draft. Depending on the function of the animal within that particular society, the best animal can be defined. A high milking cow may be suitable for Wisconsin, but in the hills of Ethiopia, a hardy cow may be suitable. The best animal should function well within the production and climatic environment and be culturally acceptable. Broiler (meat-type) chicken processing changes in the USA 1980 Percentage processed 1990 Percentage processed 67% whole birds 23% whole birds 33% Cut-ups 67% Cut-ups 10% Further processed The type of birds for cut-ups and further processing is different from just raising whole birds. This means, breeders would anticipate future markets and develop bird meat demands. It will also be the best animal for the future. 67 The best animal may not necessarily be a high performance animal for a particular animal product (milk, meat or fiber), but could be an average performance animal with reasonable resistance to an endemic disease. Defining the best animal is not an easy one and requires inputs from animal keepers, consumers, breeders and other stakeholders. Matching genotypes with suitable environments and societal acceptability depends on the availability of wide range of genotypes to choose from. A thorough knowledge of similar genotypes in other tropical regions, including nutrition and local diseases is needed. The phenotypes may be acceptable but may not necessarily cope in a new environment. The following may be considered in selecting the best animal: 1. Genetically improving locally adaptable indigenous animals. 2. Introducing breeds/stains from similar environment(s). 3. Crossbreeding of local adaptable animals with high producing animals from similar environment(s). 4. Crossbreeding with exotic breed (s) with a clear pathway for reliable supply of exotics. 5. Developing a synthetic breed. India has been successful in developing several local poultry strains most of which are strains of choice in commercial poultry production. The Australian Brangus cattle are about 3⁄8 Brahman and 5⁄8 Angus in their genetic makeup. The cattle are usually sleek black in color, but reds are also acceptable. Australian Brangus are also good walkers and foragers and "do well" in a wide variety of situations. South Africa has successfully developed both cattle and poultry breeds. Data Recording System Any serious genetic improvement program should have the infrastructure for collecting data. Without data collection it is almost impossible to undertake any form of tractable genetic improvement. Large cattle herds are kept by pastoralists in Nigeria and Eastern Africa. There are several households who own small numbers of animals. Involvement of animal keepers in a genetic improvement program offers the opportunity to collect data on their animals. Data repository center with high storage and computing ability is absolutely essential in developing any improvement programs. In the USA, the US Department of Agriculture is responsible for storage and analysis of dairy cattle data. Beef cattle data is handled by the various breed associations and some large cattle ranches. Swine and poultry are handled by their respective private breeding companies. A data repository agency need to be identified in each African country and their roles clearly defined. In recent times, the prospects of biotechnology and genomic selection have been projected as “savior” for genetic improvement in the developing world. Regardless of the potential of genomic selection, phenotypic data and pedigree information have to be collected. While it is possible to realize genetic gain with well-defined phenotypes without genomic information, it is NOT possible to realize gains without welldefined phenotypes even with genomic information (Henryon et al. 2014) 68 When the infrastructure for the well proven methods of genetic improvement is in place, advanced technologies become easy to adopt. Several novel approaches can be devised for data collection. Models can be developed by collecting unmeasured phenotypes through the measurement of a few easy-to-measure phenotypes. Figure 1 The livestock breeding and improvement cycle GENETIC IMPROVEMENT PLAN 1. ANIMAL POPULATION AND POPULATION STRUCTURE A breeding scheme defines the breeding objectives for the production of the next generation of animals. Animal breeding scheme is a combination of recording selected traits, the estimation of breeding values, the selection of potential parents and a mating program for the selected parents including appropriate (artificial) reproduction methods. The breeding scheme will also depend on the population structure. 69 (a) Breeding Programs with separate breeding and production populations Separation of breeding and production populations allows the breeder to focus on the objectives of each population. The purpose of the breeding population is for genetic improvements in traits of interest. The production population is the vehicle through which commercial production is enhanced. Genetic material from the breeding population should constantly influence the production population. Most commercial dairy farmers in developed countries and some parts of Africa purchase semen from improved bulls to constantly upgrade their herds. A breeding program in Africa can concentrate on developing males and then sell them to local producers to improve their flocks in exchange for data collection. There are several advantages to do so in addition to data collection. This automatically includes the animal keeper in the breeding scheme. Nobody kills the golden goose. When the farmer sees the benefits of improved animals without the burden of keeping males, such a scheme is bound to be successful. Over time, this strategy can become part of the sustainability plan. When the farmer links the receipt of genetic material to profits, it becomes easy for the farmer to pay for such genetic material. That is when the breeding strategy becomes sustainable. Figure 2 The components of a sustainable animal breeding scheme Components of the above structure can be adopted for sustainable genetic improvement in the developing world for cattle and small ruminants and even pigs. (b) Breeding programs with a pyramidal structure This structure is often seen in species where trait recording is extensive and also very expensive. Under this structure only a small number of individuals relative to the production population are recorded. Genetic improvement is done in a limited number of animals and these animals become the source of gene flow to the production population. The genetic improvement in small elite pure lines, the multiplication in the next generation with a much larger number of animals (parents) and the generation of the production animals in very large numbers in the final 70 generation, leads to a pyramidal structure of such a breeding and production program. This is a strategy usually employed by poultry and pig operations in developed countries. Whereas some companies house and develop only elite pure lines, others develop an integrated system from pure lines to the commercial animal. Figure 3 The classic pyramidal structure of livestock genetic improvement Under the pyramid structure, consumer concerns, lobby groups and food services concerns from the bottom of the pyramid bubbles up into the pure lines. Over time, these concerns are addressed in the genetic improvement programs in the pure lines. The poultry breeding companies develop animals for different markets and have the opportunities to respond quicker to market changes than cattle, especially since generation interval is far shorter in poultry than in cattle. In a pyramid structure all sources of genetic variation are exploited. Selection response is realized in the elite pure lines. The additive genetic variance, accuracy of estimation of breeding values and the selection intensity becomes important as these three factors determine genetic gain. The grandparent and parent multiplication levels exploit heterosis via non-additive genetic variance. In commercial pig breeding programs and in some rare cases of poultry breeding, usually a threeway cross is applied. The next figure illustrate a commercial three way cross. Usually, the terminal male is a purebred selected on growth, feed efficiency and other production characteristics. The final female is usually a hybrid taking advantage of both production and reproduction traits. 71 Figure 4 Three way commercial cross breeding scheme 2. SELECTION OR IMPROVEMENT STRATEGY This stage includes breeding value estimation, selection criteria and genetic models. After estimating breeding values and evaluating alternative selection decisions on the genetic response to selection, the actual practical selection and mating of animals can begin. Selection programs can maximize genetic gains at an inbreeding rate, e.g. ≤1% or at any level that will that will limit the accumulation of inbreeding. It is at this stage that factors such as selection intensity and generation interval are optimized. Several options can be pursued including: (a) Mass selection (b) Optimum contributing selection (OCS)-maximizing long term gains by maximizing the weighted-genetic merit of selected parents while constraining the relationship between parents (c) Index selection (d) Single or multi-trait selection (e) Correlated traits Selection allows for choosing of parents of offspring of the next generation. However, a mating plan needs to be in place to ensure that diversity is always maintained and inbreeding does not accrue at a faster rate. 72 Mating Strategy 1. Enables selection to align ancestors closer to exact threshold linear relationship. 2. Reduces rate of inbreeding, risk of allele being lost through genetic drift. 3. Reduce variation in the accuracy of breeding values between selected candidates by increasing connectivity. 4. Genomic information can enable us to develop mating designs that disperse genetic contributions more efficiently than pedigree information. a. Minimizing co-ancestry mating. b. Minimizing the covariance between ancestral contributions. c. Maximizing the probability that all ancestors contribute chromosomal segments to all allocated mating. EVALUATION OF IMPROVEMENT STRATEGY The traits in the breeding objectives may not necessarily be the selection traits, therefore, it is important that the traits in the breeding objective and the selected traits are evaluated after each year. The following evaluation criteria can be considered: 1. Selection response in selected traits. 2. Selection response in breeding objective traits. 3. Annual rate of inbreeding and inbreeding depression. 4. Annual cost of breeding program including appreciation/depreciation of fixed costs. The annual rate of inbreeding can be used as an indirect measure of diversity in the elite populations. It is important to compare the theoretical expected response to the realized response. The actual weighted selection intensity could be used to evaluate the theoretical response. If there is discrepancy, then the causes of the discrepancy need to be ascertained. Potential sources of discrepancy maybe: (a) Bias in the estimation of breeding values. (b) Inappropriate genetic model. (c) Some environmental factors not considered or accounted for. (d) Selection criteria not strictly adhere to. (e) Unexpected correlated response in other traits. DISSEMINATION OF GENETIC MATERIAL TO PRODUCTION POPULATIONS The alleles (genes) of the improved population from here on are disseminated to the production population depending on the population structure. Mostly, several forms of crossbreeding are pursued to take advantage of heterosis or hybrid vigor. Heterosis is the change in performance of crossbred animals over that of the purebreds. 73 ECONOMIC AND GENETIC SUSTAINABILITY OF BREEDING PROGRAM A breeding program is the organized structure set up to realize the desired gain in the production population. It is important for producers to also have a sense of improvement in their populations. Producers can only judge the benefit of a breeding program when the productivity of their animals improves and their “profit” margins go up. It is easy for farmers to pay for genetic material when they make a direct link of their profit margins to the genetic material they received. Economic sustainability can be achieved only when producers of improved animals can recover their cost and make a profit from recipients of their improved animals. Pertinent questions to ask at this point are: 1. Can breeding programs sponsored for up to five years be economically sustainable? 2. Is the breeding program also genetically sustainable? Genetic variation is the raw material for genetic improvement. When a genetic improvement strategy leads to genetic gain in traits, there is a loss of genetic variation. The inbreeding level and genetic diversity in the indigenous populations being improved for production also need to be constantly monitored to ensure that genetic variation between breeds (biodiversity) is preserved for the future. 74