Download AEC 550 Conservation Genetics Lecture #2 – Probability, Random

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
AEC 550 Conservation Genetics
Lecture #2 – Probability, Random mating, HW Expectations, &
Genetic Diversity,
Today:
Review Probability in Populatin Genetics
Review basic statistics
Population Definition
Random mating and non-ovelapping generations models
Hardy-Weinberg Model
Look at measures of genetic diversity, following Tuesday’s talk
Note there are times that there is a question that is left blank, make
sure you can answer it after lecture, these are often concepts that are
important for a deeper understanding and for you mid-term.
Probability Theory in Population Genetics
The PROBABILITY (P) of an event is the number of times the event will occur (a)
divided by the total number of possible events (n).
P = a/n
Multiplicative (Product) Rule : If the events A and B are independent, then the probability
that they both occur is
P(A and B) = P(A) x P(B)
That is, the probability of 2 or more independent events occurring simultaneously is equal to
the product of their individual probabilities.
For example, the probability of a progeny having the genotype AA at a locus is the
frequency of that A allele (denoted as p) in the population x the frequency of that A allele in
the population or p2
Sum Rule: The probability of 2 or more mutually exclusive events occurring is equal to the
sum of their individual probabilities:
P(A or B) = P(A) + P(B)
Using the example above, the frequency of a heterozygote genotype Aa at a locus is the
frequency of both alleles in the population multiplied. For example pq. However, there are
two ways to get the pq, a p from the mom and a q from the dad, or a q from the mom and
a p from the dad. We could write this as pq + qp = 2pq
Conditional probability – probability of one event given the other
event has occurred.
P(A|B) = P(A and B) = P(A)*P(B)
P(B)
P(B)
BASIC STATISTICS:
Basic Terms:
Population = group of things we are interested in (population of
inference)
Sample = Subset of the population – typically it is not possible to
sample the total population
Random Sample = each member has and equal and independent
chance of being in that sample
Variable = an attribute common to all members of the population but
varies in the realization, and these realizations are called varieties
Random variable = is a variable measured on the random sample
Continuous variables = metric variable, continuous scales, e.g.,
height
Discrete variable = meristic variable, countable, e.g., # of leaves, #
of digits, integers
Categorical variable = grouped and discrete but not ordered
Example:
Categories AA, Aa, aa
Discrete – number of A alleles
Parameter = numerical summary or constants that measure the
population of inference – describes the entire population
Example: 2 is the population variance and  is the population
mean for a certain trait x1
Statistic = value of this numerical constant – calculated on the
sample and used to estimate the parameter.
Example: s2 is the variance and 𝑥̅ is the mean
Summary statistics allows us to compare populations and estimate
the parameters.
Statistics are divided into 5 categories:
Descriptive
Tests of difference
Tests of relationship
Multivariate exploratory methods
Estimators of population parameters
Central Tendency: Arithmetic Mean
n
=  xi/(n-1)
I=1
N
=  Xi/N
I=1
Calculate the average fitness of a population:
From your sample of the population categorize individuals into
groups:
# Genotype Fitness
25 AA
0.7
50 Aa
0.5
25 aa
0.4
(freq. of category)(value of category)
(0.25)(0.7)+(0.5)(0.5)+(0.25)(0.4) = average fitness
The measure of variability or dispersion of points around the mean is
the variance.
2 = (X-)2/N
s2 = (x-)2/(n-1)
Standard deviation is the square root of s2 - remember that 1 SD is
68% of the central area and 2 SD is 95% of the central area.
Do not confuse SE with SD –
SD is the probability distribution of the underlying raw data of a
parameter and SE is the measure of the dispersion of a sample
statistic.
For example: SE describes the distribution of the sample mean
heterozygosity while the SD describes the sampling distribution of the
raw parameter heterozygosity.
Geometric mean – average of the product of numbers, used in growth
rate estimates
Harmonic mean – weighted for the smallest size, used in calculating
the effective population size
POPULATIONS:
Group of organisms (species) living within a sufficiently restricted
geographic area with random mating
Local interbreeding population
Local population or demes (Mendelian populations or Subpopulations)
THE MODEL OF RANDOM MATING:
P(AA)
P(aa)
P(Aa)
Parent Population
A
A
A
a
a
a
a
A
a
A
Allele Pool
P’(AA)
P’(AA)
P’(AA)
New Population genotype frequencies
NON-OVERLAPPING GENERATIONS
Mostly insects and plants. While simple, the model works for a lot of
organisms with complex life-histories:
generation
t-1
generation
t
generation
t+1
HARDY-WEINBERG MODEL
GH Hardy & W Weinberg 1908 (independently)
WE Castle (1903 Harvard geneticist)
Assumptions of HW Principal
1.
2.
3.
4.
5.
6.
7.
8.
9.
Diploid population (2N)
Sexual reproduction – no selfing
Non-overlapping generations
Locus with 2 alleles
Allele frequencies are equal in males and females
Random mating
Infinite population size
Mutation ignored
Natural Selection doesn’t affect alleles considered
Model with Theoretical Predictions
Gen 1
Gen 2
Time
p = frequency of A allele
q = frequency of a allele
p+q = 1
Independent trials (pA + qa)*(pA + qa) = 1 (all genotypes)
So p2+2pq+q2=1
(1) Equilibrium allele frequencies, after one round of random mating p
or p2 is equal to p’ and p2’
(2) What about random union of gametes?
EXAMPLE:
If we have a single locus with two alleles, A1 and A2
Let: p = frequency of A1 allele
q = frequency of A2 allele
What are the three possible genotypes?
The allele frequencies can be estimated from the genotype
frequencies:
Now if there is random mating what is the frequency of genotypes in
the next generation?
What are the progeny genotypes given the adult genotypes and
random mating?
Mating
A1A1x A1A1
A1A1xA1A2
A1A1xA2A2
A1A2xA1A2
A1A2xA2A2
A2A2xA2A2
New
genotypes
2𝑃𝑄
Genotype
Frequency
P2
2PQ
2PR
Q2
2QR
R2
Frequency of zygotes (progeny)
A1A1
A1A2
A2A2
1
½
0
¼
0
0
P’
0
½
1
½
½
0
Q’
0
0
0
¼
½
1
R’
P’+Q’+R’=1
𝑄2
𝑃′ = 𝑃2 +
+ = ⋯ = 𝑝2
2
4
2𝑃𝑄
𝑄2 2𝑄𝑅
′
𝑄 =
+ 2𝑃𝑅 +
+
= ⋯ = 2𝑝𝑞
2
2
2
2𝑄𝑅 𝑄2
′
2
𝑅 =𝑅 +
+
= ⋯ = 𝑞2
2
4
For extra credit on your homeowrk this week, can you prove the
connection of the equation for P’ to p2, Q’ to 2pq, and R’ to q2?
EXAMPLE
Measures of Genetic Diversity - Allozyme Data
There are two standard measures of allozyme diversity
(1) P, the proportion of loci sample that are polymorphic
P = x/m
x is the number of polymorphic loci in a sample of m loci
Note: Often you’ll see this measure as a measure of diversity for allozyme loci, but
because of sampling (low sample numbers may have loci that appear monomorphic,
but are polymorphic with more individuals in the sample, see below), this is not a good
measure for highly polymorphic loci.
(2)
H, mean Heterozygosity
Sample a locus with two alleles at frequencies of 0.4 and 0.6
Let p1=0.4 and p2=0.6
Homozygotes p12 =0.16; p22=0.36
Therefore 1-(0.16+0.36)= 0.48 (48% heterozygote)
Average over all loci including monomorphic ones!
General equation (Nei 1987)
Unbiased estimate
Measures of Genetic Diversity
Allozymes Data
Note: The general equation for expected heterozygosity is often
referred to as a measure of diversity. We use this equation for more
than just allozymes, and it’s fundamental to understand for measuring
divergences among populations (Fstatistics). I like to think of the
measure as the probability of an individual being heterozygous at a
given locus. Many human microsatellite loci are >0.85, which means
you have a >85% chance of being heterozygous at this locus. I’ll
break down the equation here and we will talk about it more in class
In the equation above the pi is the ith allele of n alleles at a locus.
For example p1, p2, p3 … could correspond to p, q, r, …
Remember the HW proportions equation p2 + 2pq + q2 = 1, then this
follows:
Rearrange the above equation = p2 + q2 + 2pq = 1
Solve for heterzygotes = 2pq = 1 – (p2 + q2)
If you think about a situation, which could be true for many loci, that
alleles are 4 or more, it becomes much easier to take the sum of the
homozygous rather than the heterozygous genotype combinations.
For example if you have 6 alleles, there are a possible 21 genotypes:
𝐴(𝐴 + 1)
6×7
=
= 21
2
2
Of this 21 possible there are only 6 kinds of homozygous genotypes
(A1A1, A2A2, A3A3, etc. etc.) but there are 15 different heterozygous
genotypes. As you increase it is easier to just square the
homozygous individuals to calculate the heterozygosity frequency.
Heterozygosity = 1 – (sum of all the homozygous frequencies)
Measures of Genetic Diversity
Microsatellite Data
There are 4 standard measures of microsatellite diversity
(1) P, the proportion of loci sample that are polymorphic
P = x/m
x is the number of polymorphic loci in a sample of m loci
(2)
HE – Expected heterozygosity – (Nei 1987) general measure of genetic diverisity
Problem- high diversity because of high mutation rate
Average number of alleles captured (all loci
combined)
100
90
80
70
60
SM
BC
MB
FB
50
40
30
20
10
0
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
Sample Size (2N)
(3) A - Allele number- more sensitive to loss of genetic variation
# of alleles per locus at each population
(4) Rg - Allelic Richness
Samples alleles at individual loci at the same sample size among populations –
using a rarefaction method to estimate allelic richness. The sub g is the number
of genes sampled.
Locusm.2
11 12 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Total Unique All.
Repeat Number
Locations
Big Creek Adults
0
Monterey Bay Adults 0
Fort Bragg Adults
1
San Miguel Is. Adults 1
Fort Ross Juveniles
0
Monterey Bay Juveniles0
Carmel Bay Juveniles 0
Total
2
0
0
0
0
1
0
0
1
0
0
1
4
2
32
4
43
0
2
0
0
0
2
0
4
0 8
0 4
0 14
1 18
1 19
4 73
0 11
6 147
2 1
1 0
3 0
7 1
5 0
68 6
8 2
94 10
12
5
14
15
61
107
14
228
2
3
3
3
25
15
3
54
19
7
24
15
31
57
6
159
5
5
6
9
10
45
4
84
6
2
14
20
8
103
12
165
8
2
14
11
25
74
10
144
6 5 0
5 0 2
11 7 2
7 9 1
18 4 3
33 18 3
3 2 0
83 45 11
2
0
0
1
2
4
0
9
0
0
0
1
0
8
1
10
76
38
114
124
215
652
80
1299
0
1
0
2
1
1
0
5
# of Allele
12
11
13
17
16
17
13
99
Allele Number (A) = #alleles in pop
Big Creek Adults Locus m2 = 12
Monterey juveniles Locus m2 = 17
Big difference in population size!
Allelic richness (Rg) measures # of alleles using sample of N individuals of the smallest
population size for all loci (N=38)
Measures of Genetic Variation Using Sequence Data
1. Nucleotide Diversity - π
π = (n/n-1)Σxixjπij
xi = is the frequency of that haplotype divided by total number of
haplotypes
n/(n-1) = (n/n-1) = n is the # of alleles in gene, sampling error term
πij = proportion of nucleotides that differ between type I and type j
2. The number of segregation sites – θ (Theta)
Infinite-alleles model
θ = 4NEμ
S = np/nt the number of polymorphic sites over total number of
sites
Here is how we estimate θ
Which we can rearrange to be θ = S/a1
At Steady State in the infinite-alleles method π = θ
Estimating π and θ from DNA Sequence Data
An Example
-We collected a sample of 5 banana slugs from the woods outside of
UC Santa Cruz campus in California
-We sequence 500 bp region of the mitochondrial COI gene and
observe 5 segregating sites in four distinct haplotypes
Haplotype 1
Haplotype 2
Haplotype 3
Haplotype 4
N
2
1
1
1
4
T
T
C
C
45
G
A
G
G
Nucleotide site in gene
345
398
T
C
T
T
T
C
G
C
456
T
A
T
T
1. Proportion of polymorphic sites - (referred to as P or S)
2. Nucleotide diversity - π
π = (n/n-1)Σxixjπij
n = 5, the number of polymorphic sites, therefore n/n-1 = 5/4
Frequency Hap1
Hap2
Hap3
Hap4
Pairwise
Diff.
Hap1&Hap2
Hap1&Hap3
Hap1&Hap4
Hap2&Hap3
Hap2&Hap4
Hap3&Hap4
0.4
0.2
0.2
0.2
(note that there are 2 Haplotype 1s)
0.006 (3 pairwise differences out of 500 possible)
0.002
0.004
0.008
0.01
0.002
Make a matrix to sum
Hap (i)
Hap (j)
1
1
1
2
1
3
1
4
2
1
2
2
2
3
2
4
3
1
3
2
3
3
3
4
4
1
4
2
4
3
4
4
xi
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
π = (n/n-1)Σxixjπij
π = 5/4*(0.00352) = 0.0044
xj
0.4
0.2
0.2
0.2
0.4
0.2
0.2
0.2
0.4
0.2
0.2
0.2
0.4
0.2
0.2
0.2
πij
0
0.006
0.002
0.004
0.006
0
0.008
0.01
0.002
0.008
0
0.002
0.004
0.01
0.002
0
Σ
xixjπij
0
0.00048
0.00016
0.00032
0.00048
0
0.00032
0.0004
0.00016
0.00032
0
0.00008
0.00032
0.0004
0.00008
0
0.00352
Estimating π and θ from DNA Sequence Data
-We collected a sample of 5 banana slugs from the woods outside of
UC Santa Cruz campus in California
-We sequence 500 bp region of the mitochondrial COI gene and
observe 5 segregating sites in four distinct haplotypes
Haplotype 1
Haplotype 2
Haplotype 3
Haplotype 4
N
2
1
1
1
4
T
T
C
C
45
G
A
G
G
Nucleotide site in gene
345
398
T
C
T
T
T
C
G
C
3. Segregating Sites θ
S = np/nt
θ = S/a1
S = # segregating sites/total number of sites analyzed = n
S = 5/500 = 0.01
a1 = 1/1+1/2+…1/n-1 = 1/1 + 1/2 + 1/3 +1/4= 2.083
Note: a1 = # of alleles, in the example above you have 5 alleles or
segregating sites and you divide by starting at 1 to n-1 to calcuated
a1.
θ = S/ a1 = 0.010/2.083 = 0.0048
Notice that both estimates of nucleotide diversity are similar π =
θ which indicated steady state
456
T
A
T
T