Download Populations

Document related concepts

History of genetic engineering wikipedia , lookup

Copy-number variation wikipedia , lookup

Gene desert wikipedia , lookup

Genetic studies on Bulgarians wikipedia , lookup

Koinophilia wikipedia , lookup

Gene therapy wikipedia , lookup

Public health genomics wikipedia , lookup

Pharmacogenomics wikipedia , lookup

Gene nomenclature wikipedia , lookup

Genetic engineering wikipedia , lookup

Gene expression programming wikipedia , lookup

Genome (book) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Inbreeding wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Designer baby wikipedia , lookup

Polymorphism (biology) wikipedia , lookup

Dominance (genetics) wikipedia , lookup

Genetics and archaeogenetics of South Asia wikipedia , lookup

Hardy–Weinberg principle wikipedia , lookup

Genetic drift wikipedia , lookup

Population genetics wikipedia , lookup

Human genetic variation wikipedia , lookup

Microevolution wikipedia , lookup

Transcript
Populations
Large populations
Terns
Small populations
Dryopteris fragrans, a rare cliff fern
Dynamic populations
Homo sapiens
Complex populations
Markers: isozymes
AFLPs
MMMMM
Illumina Beadstation genotyping for SNPs
•
•
High throughput genotypins
Genotyping of a cross:
•
Low cost per genotype (5-20 cents) but need to assay for
large number of genotypes (either 384, or 768, or 1586)
makes total cost large (thousands of $)
What do populations have to do with
genetic markers

Influence levels of diversity

Conversely, polymorphic genetic
markers can infer many population
processes

Emphasis in FRST432 is the latter
Quantifying genetic variation

Gene frequency

Genetic diversity

Hardy-Weinberg
Estimation of gene frequency

For co-dominant loci, simply count the
numbers (“gene counting method”)

Gene counting method also is the
“maximum likelihood estimate”
Estimation of allele frequency

MN blood group:
 genotype
MM
MN
NN
number
392
707
320

Total: 1419 (actual sample size is twice)

Frequency of M=PM=(2 x 392 + 707)/[2 x 1419]=0.525
Frequency of N is 1-PM

Estimation of gene frequency

Estimation based upon gene counting


More theoretical relationship



pA = (2NAA+NAa.)/(2NAA+2NAa+2Naa)
pA = fAA+.5fAa
F’s are frequencies
Var(pA) = Var(pa) = pA (1-pA)/(2N)


Binomial sampling variance
Construct confidence interval
Comparing among populations
Hardy-Weinberg

Predict genotypic frequencies from gene frequencies



F(AA)=p2
F(Aa)=2pq
F(aa)=q2

Expansion of (p+q)2

HW is basis for almost all models

Inbreeding also detected as excess of homozygotes
Historical context

Is the population going to be driven to a
particular frequency for an allele simply
because it is inherited in a Mendelian fashion?

Is the recessive phenotype driven to occur in
25% of the population?

Hardy and Weinberg proved this was false
HW “Equilibrium”

Equilibrium = nothing changes across
generations

Genotypes are transient, broken up each
generation

Reconstituted randomly into zygotes

Reached in just one generation
Assumptions of HW

No directive forces


No mutation, migration, selection
No dispersive forces

Infinite population size, random mating
Predictions of HW

Allele frequencies unchanged over time

After one generation, genotypic
frequencies unchanged over time

Allele frequencies, not genotypic
frequencies, are sufficient parameters for
models
One prediction of H-W rule
The fundamental measure of genetic variation:
expected heterozygosity
pi

At one locus, gene frequency for i-th allele is

expected Hardy-Weinberg frequency of homozygous
genotype is
2
pi

Over all possible alleles i, i=1,n, the probability that the
n
locus is homozygous for any allele is
J p
i 1
2
i
Expected heterozygosity = 1-expected homozygosity
n
H  1 J  1  p
i 1
often referred to as gene diversity
2
i
Heterozygosity at 20 variable allozymes out of 71
loci sampled in a population of European people

Gene Locus

Aph
Alkaline phosphatase (placental)
Acph
Acid phosphatase
Gpt
Glutamate-pyruvate transaminase
Adh-3
Alcohol dehydrogenase-3
Peps
Pepsinogen
Pgm-2
Phosphoglucomutase-2
Pept-A
Peptidase-A
Pgm-1
Phosphoglucomutase-l
Me
Malic enzyme
Ace
Acetylcholinesterase
Adn
Adenosine deaminase
Gput
Galactose-1-phosphate uridyl transferase
Adk
Adenylate kinase
Amy
Amylase (pancreatic)
Adh-2
Alcohol dehydrogenase-2
6Pgdh
6-Phosphogluconate dehydrogenase
Hk
Hexokinase (white-cell)
Got
Glutamate-oxaloacetate transaminase
Pept-C
Peptidase-C
Pept-D
Peptid ase-D
51 Loci invariant (Monomorphic)





















Enzyme Encoded
After H. Harris and D. A. Hopkinson, J. Human Genetics
Heterozygosity H
0.53
0.52
0.50
0.48
0.47
0.38
0.37
0.36
0.30
0.23
0.11
0.11
0.09
0.09
0.07
0.05
0.05
0.03
0.02
0.02
0.00
Comparison of isozyme variation across kingdoms
Variation of diversity among species

Explaining levels of diversity is a prime activity of
population genetics
 Plants have most diverse array of life histories, shortlived and self-fertilizers have least variation, long-lived
outcrossers have most variation
 Vertebrates have narrowest array of life histories,
hence lowest variation of diversity among species
 Just explaining the mean level of diversity is
challenging

Outcome of complex interplay of mutation, selection,
and chance (drift)…
Q. What does heterozygosity measure?
A. The tendency for a population to have “intermediate” gene frequencies
Other measures of genetic variation
Polymorphism

Ford (1940)

“the occurrence together in the same habitat of two or more
discontinuous forms in such proportions that the rarest of
them cannot be maintained by recurrent mutation”
probably not a good definition in 2006
Polymorphism



Cavalli-Sforza and Bodmer (1971)
“the occurrence in the same population of two or more alleles
at one locus, each with appreciable frequency”
but what is “appreciable frequency?”
Other measures of diversity
Proportion of polymorphic loci: P


practical definition of “appreciable frequency”
arbitrary limit for most common allele
0.95 normally
0.99 sometimes (used when sample is adequate, N >100)
Numbers of alleles

Number of alleles, n



allele diversity or allele richness
strongly influenced by sample size
Effective number of alleles


ne = 1 / ( 1 - H )
number of equally frequent alleles that gives observed H
P vs. H over taxa
Measures of nucleotide diversity

Proportion of sites that differ = S/N



S=number of segregating sites
N=number of nucleotide sites
Depends on number of sequences aligned



the more sequences, the higher S
like the proportion of polymorphic loci
Nucleotide diversity



Heterozygosity averaged over aligned sites
If there are K sequences, make all possible pairwise
comparisons (there are K(K-1)/2 comparisons)
Analogous to H as estimated from gene frequencies
Estimation of gene frequency

Gene counting

Freq(A) = Freq(AA)+.5 Freq(Aa)

Var(p) = p(1-p)/(2N)



Binomial sampling variance
Construct confidence interval
Dominance: need Hardy-Weinberg
Estimation of gene frequency

Dominance: assume Hardy-Weinberg
f aa  q
qˆ 
2
f aa
Var (qˆ )  (1  q ) /( 4 N )
2
Kermode bear example

A total of 87 bears were collected for hair samples on Gribbell,
Princess Royal and Roderick Islands

66 were black, 21 were white

Frequency of recessive phenotype = 21/(66+21) = 0.241

Estimate of gene frequency of white gene is square root of this:
sqrt(0.241) = 0.49

Variance is (1-0.492)/(4*87)=0.00218

SE is sqrt of this, sqrt(0.00218) =0.046
We also have nucleotide data for gene
underlying Kermode coat color

AA and AG = black, GG=white
 42 AA, 24 AG, 21 GG
 Gene frequency of G (white) =



(24 + 2 x 21)) / (2 x 87) = 0.38
SE = sqrt(q(1-q)/2N) = 0.040
Using just coat color, with white recessive


q=0.49, SE=0.046 (from previous slide)
q is higher (0.49 vs. 0.38); why?
Expected frequency of white bears

Using co-dominant Mc1r data, expected number of GG = 87 x (0.38)2 =
12.5

Observed number is 21 (>>12.5)

Can be caused by
 Assortative mating which creates excess of white genotype (GG)
over HW expectations
 Variation of gene frequency among islands

Microsatellite loci show no excess homozygosity!
 Assortative mating at coat color locus
 Excess homozygosity only at Mc1r
Null alleles or inbreeding? Fis values (excess homozygosity above
HW expectations) for Yellow Warbler microsatellites
Locus
Caµ 28
Dpµ 01
Dpµ 03
Dpµ 15
Dpµ 16
Maµ 23
Fis
value
0.30
0.01
0.05
0.12
0.00
0.02
Another exercise in HW: null alleles increase
apparent homozygote frequency
Sum of all true
homozygotes plus all
heterozygous nulls
(e.g., sum last row and
column of the expansion of
gene frequencies, except
for the lower right corner)
Equals expected
homozygosity plus twice
null frequency
n
p
i 1
 p12

 p1 p2
 ...

 p1 pn
2
i
n 1
 2 pi pn
i 1
p1 p2
p22
...
...
...
...
p2 pn ...
p1 pn 

p 2 pn 
... 
2 
pn 
J e  2 pn (1  pn )
Populations: defining
and identifying
Two major paradigms for defining
populations
•Ecological paradigm
A group of individuals of the same species
that co-occur in space and time and have an
opportunity to interact with each other.
•Evolutionary paradigm
A group of individuals of the same species
living in close enough proximity that any
member of the group can potentially mate
with any other member.
Cocoa from 32 abandoned estates in Trinidad 88
Imperial College Selection (ICS) clones conserved
in the International Cocoa Genebank, Trinidad,
assayed for 35 microsatellite loci
Unweighted pair group
method used to construct
dendrogram of relatedness
between individuals
The different colored groups
can be identified by eye, or
identified with the computer
program “STRUCTURE” (as
was done here).
Yellow perch
The yellow perch (Perca flavescens) is found in the
United States and Canada, and looks similar to the
European perch but are paler. It is in the same family as
the walleye, but in a different family from white perch.
The yellow perch plays a significant role in the survival and
success of the double-crested cormorant and other birds,
predatory fish, commercial fisherman, and sport fisherman
in the Great Lakes region. This fish must be properly
managed in order to prevent the trophic structure and
economy of the Great Lakes region from collapsing.
mt DNA Control region haplotype frequency
patterning for Yellow Perch spawning site groups
across North America
Relationships among mtDNA haplotypes of Yellow Perch
Allele distribution for six representative Yellow Perch microsatellite loci
among selected regions. Rings represent loci, colors within a ring
represent alleles.
Bayesian assignment of Yellow Perch
genetic structure, using STRUCTURE.
Vertical bars represent individuals, colors
within a bar represent probability of
assignment to a cluster. 8 microsatellite
loci, 25 collection sites, N= 495 fish, K=10
Inference of population structure
using multi-locus genotype data
STRUCTURE V2.1
Pritchard, J.K., and Wen, W. (2004)
Pritchard, Stephens, and Donnelly (2000)
 Falush, Stephens, and Pritchard (2003)

Main objective of “structure”

Assign individuals to populations on the bases
of their genotypes, while simultaneously
estimating population allele frequencies

Infer number of populations “K” in the process
Other objectives
Begin with a set of predefined populations
and to classify individuals of unknown
origin
 Identify the extent of admixture of
individuals
 Infer the origin of particular loci in the
sampled individuals

Structure is a Bayesian Model Based
method of clustering
many assumptions about
parameters and distributions
Four basic models
1.
Model without admixture
each individual is assumed to originate in
one (only one) of K populations
2.
Model with admixture
each individual is assumed to have inherited
some proportion of its ancestry from each of
K populations
Four basic models
3.
Linkage model
“Chunks” of chromosomes as derived as intact
units from one or another K population and
all allele copies on the same “chunk” derive
from the same population.
Four basic models
4.
F model
The populations all diverged from a common
ancestral population at the same time, but
allows that the populations may have
experienced different amounts of drift since
the divergence event
Assumptions
• The main modeling assumptions are HardyWeinberg equilibrium (HW) within populations
and complete linkage equilibrium (LD) between
loci within populations
• The model accounts for the presence of HW
or LD by introducing population structure and
attempts to find populations groupings that (as
far as possible) are not in disequilibrium
Hardy-Weinberg

Gives relationship between gene frequencies and
genotypic frequencies, assuming random mating




F(AA)=p2
F(Aa)=2pq
F(aa)=q2
The extent of a randomly mating population is predicted
from STUCTURE using HW predictions
Pairwise comparison of LD along chromosomes, high LD is red, low LD is green
Bayesian procedure employed by
STRUCTURE



Step 1: estimate the allele frequencies for each
population assuming that the population of origin
of each individual is known.
Step 2: estimate the population of origin of each
individual, assuming that the population allele
frequencies are known.
Iterate several times using “Markov-Chain
Monte-Carlo” procedure
Good and bad things about
“structure”

When populations are real, most efficient way to
estimate number of populations K and the
membership of individuals to populations

When populations are more continuous (for
example a continuous cline), can impose
incorrect structure on data, and create an
arbitrary number of artificial groups.
Human variation and differentiation
Hundreds of microsatellites now available
 ALU markers

Can evolutionary history be reconstructed
 Are there distinct “races”
 Are certain populations less diverse

K is set to 3
We place individuals in
three groups, without
prior knowledge of
group membership
More loci, the better
identification of groups
Noah Rosenberg et al, Science, 2002
• Human Genome Diversity Panel
• 55 Indigenous Populations from 5
Continents: Africa, Americas, Asia,
Europe, Oceania, total of 1,056 people
• 377 microsatellite markers assayed
Structure within structure
Jun Li et al, Science, 2008
Human Genome Diversity Panel, 938
individuals from 51 populations, 5
continents
 650,000 SNP Markers

Bayesian prior for population assignment
Ursus americanus ssp. Kermodii
Purpose of Kermode bear study, conducted in
conjunction with Western Forest Products
• Determine if white bear populations are genetically
unique for other types of genetic variation
• Identify the gene, or genes, that cause the white coat
color difference
• Infer the role of natural selection vs. genetic drift from
patterns of genetic variation for this gene
• Predict effects of forest practices using this information
Populations
sampled for
Kermode bear
hairs
Barb wire hair trap with Kermode hair
From 1685 hair samples
to 766 microsatellite profiles
to 216 unique genotypes (22 Kermode)
Kermode-containing populations (yellow):
perhaps 10% less genetic variation, but other
island populations show 10% less variation too
Genetic divergence (below diagonal),
gene flow (above diagonal)
Relationship of populations based upon pairwise
genetic divergence (previous table);
gene frequencies of white phase given in parenthesis
(0.00)
(0.08)
(0.02)
(0.05)
(.013)
(0.05)
(0.56)
(0.33)
(0.04)
(0.21)
(0.00)
(0.10)
Kermode populations
are not closely related
to each other, some
suggestion of complex
interrelations
E-Pr
E-Pr
H (Hawkesbury)
W-H
P (Pooly Is), R (Roderick Is)
T (Terrace/Nass)