Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
AEC 550 Conservation Genetics Lecture #2 – Probability, Random mating, HW Expectations, & Genetic Diversity, Today: Review Probability in Populatin Genetics Review basic statistics Population Definition Random mating and non-ovelapping generations models Hardy-Weinberg Model Look at measures of genetic diversity, following Tuesday’s talk Note there are times that there is a question that is left blank, make sure you can answer it after lecture, these are often concepts that are important for a deeper understanding and for you mid-term. Probability Theory in Population Genetics The PROBABILITY (P) of an event is the number of times the event will occur (a) divided by the total number of possible events (n). P = a/n Multiplicative (Product) Rule : If the events A and B are independent, then the probability that they both occur is P(A and B) = P(A) x P(B) That is, the probability of 2 or more independent events occurring simultaneously is equal to the product of their individual probabilities. For example, the probability of a progeny having the genotype AA at a locus is the frequency of that A allele (denoted as p) in the population x the frequency of that A allele in the population or p2 Sum Rule: The probability of 2 or more mutually exclusive events occurring is equal to the sum of their individual probabilities: P(A or B) = P(A) + P(B) Using the example above, the frequency of a heterozygote genotype Aa at a locus is the frequency of both alleles in the population multiplied. For example pq. However, there are two ways to get the pq, a p from the mom and a q from the dad, or a q from the mom and a p from the dad. We could write this as pq + qp = 2pq Conditional probability – probability of one event given the other event has occurred. P(A|B) = P(A and B) = P(A)*P(B) P(B) P(B) BASIC STATISTICS: Basic Terms: Population = group of things we are interested in (population of inference) Sample = Subset of the population – typically it is not possible to sample the total population Random Sample = each member has and equal and independent chance of being in that sample Variable = an attribute common to all members of the population but varies in the realization, and these realizations are called varieties Random variable = is a variable measured on the random sample Continuous variables = metric variable, continuous scales, e.g., height Discrete variable = meristic variable, countable, e.g., # of leaves, # of digits, integers Categorical variable = grouped and discrete but not ordered Example: Categories AA, Aa, aa Discrete – number of A alleles Parameter = numerical summary or constants that measure the population of inference – describes the entire population Example: 2 is the population variance and is the population mean for a certain trait x1 Statistic = value of this numerical constant – calculated on the sample and used to estimate the parameter. Example: s2 is the variance and 𝑥̅ is the mean Summary statistics allows us to compare populations and estimate the parameters. Statistics are divided into 5 categories: Descriptive Tests of difference Tests of relationship Multivariate exploratory methods Estimators of population parameters Central Tendency: Arithmetic Mean n = xi/(n-1) I=1 N = Xi/N I=1 Calculate the average fitness of a population: From your sample of the population categorize individuals into groups: # Genotype Fitness 25 AA 0.7 50 Aa 0.5 25 aa 0.4 (freq. of category)(value of category) (0.25)(0.7)+(0.5)(0.5)+(0.25)(0.4) = average fitness The measure of variability or dispersion of points around the mean is the variance. 2 = (X-)2/N s2 = (x-)2/(n-1) Standard deviation is the square root of s2 - remember that 1 SD is 68% of the central area and 2 SD is 95% of the central area. Do not confuse SE with SD – SD is the probability distribution of the underlying raw data of a parameter and SE is the measure of the dispersion of a sample statistic. For example: SE describes the distribution of the sample mean heterozygosity while the SD describes the sampling distribution of the raw parameter heterozygosity. Geometric mean – average of the product of numbers, used in growth rate estimates Harmonic mean – weighted for the smallest size, used in calculating the effective population size POPULATIONS: Group of organisms (species) living within a sufficiently restricted geographic area with random mating Local interbreeding population Local population or demes (Mendelian populations or Subpopulations) THE MODEL OF RANDOM MATING: P(AA) P(aa) P(Aa) Parent Population A A A a a a a A a A Allele Pool P’(AA) P’(AA) P’(AA) New Population genotype frequencies NON-OVERLAPPING GENERATIONS Mostly insects and plants. While simple, the model works for a lot of organisms with complex life-histories: generation t-1 generation t generation t+1 HARDY-WEINBERG MODEL GH Hardy & W Weinberg 1908 (independently) WE Castle (1903 Harvard geneticist) Assumptions of HW Principal 1. 2. 3. 4. 5. 6. 7. 8. 9. Diploid population (2N) Sexual reproduction – no selfing Non-overlapping generations Locus with 2 alleles Allele frequencies are equal in males and females Random mating Infinite population size Mutation ignored Natural Selection doesn’t affect alleles considered Model with Theoretical Predictions Gen 1 Gen 2 Time p = frequency of A allele q = frequency of a allele p+q = 1 Independent trials (pA + qa)*(pA + qa) = 1 (all genotypes) So p2+2pq+q2=1 (1) Equilibrium allele frequencies, after one round of random mating p or p2 is equal to p’ and p2’ (2) What about random union of gametes? EXAMPLE: If we have a single locus with two alleles, A1 and A2 Let: p = frequency of A1 allele q = frequency of A2 allele What are the three possible genotypes? The allele frequencies can be estimated from the genotype frequencies: Now if there is random mating what is the frequency of genotypes in the next generation? What are the progeny genotypes given the adult genotypes and random mating? Mating A1A1x A1A1 A1A1xA1A2 A1A1xA2A2 A1A2xA1A2 A1A2xA2A2 A2A2xA2A2 New genotypes 2𝑃𝑄 Genotype Frequency P2 2PQ 2PR Q2 2QR R2 Frequency of zygotes (progeny) A1A1 A1A2 A2A2 1 ½ 0 ¼ 0 0 P’ 0 ½ 1 ½ ½ 0 Q’ 0 0 0 ¼ ½ 1 R’ P’+Q’+R’=1 𝑄2 𝑃′ = 𝑃2 + + = ⋯ = 𝑝2 2 4 2𝑃𝑄 𝑄2 2𝑄𝑅 ′ 𝑄 = + 2𝑃𝑅 + + = ⋯ = 2𝑝𝑞 2 2 2 2𝑄𝑅 𝑄2 ′ 2 𝑅 =𝑅 + + = ⋯ = 𝑞2 2 4 For extra credit on your homeowrk this week, can you prove the connection of the equation for P’ to p2, Q’ to 2pq, and R’ to q2? EXAMPLE Measures of Genetic Diversity - Allozyme Data There are two standard measures of allozyme diversity (1) P, the proportion of loci sample that are polymorphic P = x/m x is the number of polymorphic loci in a sample of m loci Note: Often you’ll see this measure as a measure of diversity for allozyme loci, but because of sampling (low sample numbers may have loci that appear monomorphic, but are polymorphic with more individuals in the sample, see below), this is not a good measure for highly polymorphic loci. (2) H, mean Heterozygosity Sample a locus with two alleles at frequencies of 0.4 and 0.6 Let p1=0.4 and p2=0.6 Homozygotes p12 =0.16; p22=0.36 Therefore 1-(0.16+0.36)= 0.48 (48% heterozygote) Average over all loci including monomorphic ones! General equation (Nei 1987) Unbiased estimate Measures of Genetic Diversity Allozymes Data Note: The general equation for expected heterozygosity is often referred to as a measure of diversity. We use this equation for more than just allozymes, and it’s fundamental to understand for measuring divergences among populations (Fstatistics). I like to think of the measure as the probability of an individual being heterozygous at a given locus. Many human microsatellite loci are >0.85, which means you have a >85% chance of being heterozygous at this locus. I’ll break down the equation here and we will talk about it more in class In the equation above the pi is the ith allele of n alleles at a locus. For example p1, p2, p3 … could correspond to p, q, r, … Remember the HW proportions equation p2 + 2pq + q2 = 1, then this follows: Rearrange the above equation = p2 + q2 + 2pq = 1 Solve for heterzygotes = 2pq = 1 – (p2 + q2) If you think about a situation, which could be true for many loci, that alleles are 4 or more, it becomes much easier to take the sum of the homozygous rather than the heterozygous genotype combinations. For example if you have 6 alleles, there are a possible 21 genotypes: 𝐴(𝐴 + 1) 6×7 = = 21 2 2 Of this 21 possible there are only 6 kinds of homozygous genotypes (A1A1, A2A2, A3A3, etc. etc.) but there are 15 different heterozygous genotypes. As you increase it is easier to just square the homozygous individuals to calculate the heterozygosity frequency. Heterozygosity = 1 – (sum of all the homozygous frequencies) Measures of Genetic Diversity Microsatellite Data There are 4 standard measures of microsatellite diversity (1) P, the proportion of loci sample that are polymorphic P = x/m x is the number of polymorphic loci in a sample of m loci (2) HE – Expected heterozygosity – (Nei 1987) general measure of genetic diverisity Problem- high diversity because of high mutation rate Average number of alleles captured (all loci combined) 100 90 80 70 60 SM BC MB FB 50 40 30 20 10 0 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 Sample Size (2N) (3) A - Allele number- more sensitive to loss of genetic variation # of alleles per locus at each population (4) Rg - Allelic Richness Samples alleles at individual loci at the same sample size among populations – using a rarefaction method to estimate allelic richness. The sub g is the number of genes sampled. Locusm.2 11 12 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Total Unique All. Repeat Number Locations Big Creek Adults 0 Monterey Bay Adults 0 Fort Bragg Adults 1 San Miguel Is. Adults 1 Fort Ross Juveniles 0 Monterey Bay Juveniles0 Carmel Bay Juveniles 0 Total 2 0 0 0 0 1 0 0 1 0 0 1 4 2 32 4 43 0 2 0 0 0 2 0 4 0 8 0 4 0 14 1 18 1 19 4 73 0 11 6 147 2 1 1 0 3 0 7 1 5 0 68 6 8 2 94 10 12 5 14 15 61 107 14 228 2 3 3 3 25 15 3 54 19 7 24 15 31 57 6 159 5 5 6 9 10 45 4 84 6 2 14 20 8 103 12 165 8 2 14 11 25 74 10 144 6 5 0 5 0 2 11 7 2 7 9 1 18 4 3 33 18 3 3 2 0 83 45 11 2 0 0 1 2 4 0 9 0 0 0 1 0 8 1 10 76 38 114 124 215 652 80 1299 0 1 0 2 1 1 0 5 # of Allele 12 11 13 17 16 17 13 99 Allele Number (A) = #alleles in pop Big Creek Adults Locus m2 = 12 Monterey juveniles Locus m2 = 17 Big difference in population size! Allelic richness (Rg) measures # of alleles using sample of N individuals of the smallest population size for all loci (N=38) Measures of Genetic Variation Using Sequence Data 1. Nucleotide Diversity - π π = (n/n-1)Σxixjπij xi = is the frequency of that haplotype divided by total number of haplotypes n/(n-1) = (n/n-1) = n is the # of alleles in gene, sampling error term πij = proportion of nucleotides that differ between type I and type j 2. The number of segregation sites – θ (Theta) Infinite-alleles model θ = 4NEμ S = np/nt the number of polymorphic sites over total number of sites Here is how we estimate θ Which we can rearrange to be θ = S/a1 At Steady State in the infinite-alleles method π = θ Estimating π and θ from DNA Sequence Data An Example -We collected a sample of 5 banana slugs from the woods outside of UC Santa Cruz campus in California -We sequence 500 bp region of the mitochondrial COI gene and observe 5 segregating sites in four distinct haplotypes Haplotype 1 Haplotype 2 Haplotype 3 Haplotype 4 N 2 1 1 1 4 T T C C 45 G A G G Nucleotide site in gene 345 398 T C T T T C G C 456 T A T T 1. Proportion of polymorphic sites - (referred to as P or S) 2. Nucleotide diversity - π π = (n/n-1)Σxixjπij n = 5, the number of polymorphic sites, therefore n/n-1 = 5/4 Frequency Hap1 Hap2 Hap3 Hap4 Pairwise Diff. Hap1&Hap2 Hap1&Hap3 Hap1&Hap4 Hap2&Hap3 Hap2&Hap4 Hap3&Hap4 0.4 0.2 0.2 0.2 (note that there are 2 Haplotype 1s) 0.006 (3 pairwise differences out of 500 possible) 0.002 0.004 0.008 0.01 0.002 Make a matrix to sum Hap (i) Hap (j) 1 1 1 2 1 3 1 4 2 1 2 2 2 3 2 4 3 1 3 2 3 3 3 4 4 1 4 2 4 3 4 4 xi 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 π = (n/n-1)Σxixjπij π = 5/4*(0.00352) = 0.0044 xj 0.4 0.2 0.2 0.2 0.4 0.2 0.2 0.2 0.4 0.2 0.2 0.2 0.4 0.2 0.2 0.2 πij 0 0.006 0.002 0.004 0.006 0 0.008 0.01 0.002 0.008 0 0.002 0.004 0.01 0.002 0 Σ xixjπij 0 0.00048 0.00016 0.00032 0.00048 0 0.00032 0.0004 0.00016 0.00032 0 0.00008 0.00032 0.0004 0.00008 0 0.00352 Estimating π and θ from DNA Sequence Data -We collected a sample of 5 banana slugs from the woods outside of UC Santa Cruz campus in California -We sequence 500 bp region of the mitochondrial COI gene and observe 5 segregating sites in four distinct haplotypes Haplotype 1 Haplotype 2 Haplotype 3 Haplotype 4 N 2 1 1 1 4 T T C C 45 G A G G Nucleotide site in gene 345 398 T C T T T C G C 3. Segregating Sites θ S = np/nt θ = S/a1 S = # segregating sites/total number of sites analyzed = n S = 5/500 = 0.01 a1 = 1/1+1/2+…1/n-1 = 1/1 + 1/2 + 1/3 +1/4= 2.083 Note: a1 = # of alleles, in the example above you have 5 alleles or segregating sites and you divide by starting at 1 to n-1 to calcuated a1. θ = S/ a1 = 0.010/2.083 = 0.0048 Notice that both estimates of nucleotide diversity are similar π = θ which indicated steady state 456 T A T T