Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Salmon, Genetics, and Monte Carlo
Eric C. Anderson
Interdisciplinary Program in Quantitative
Ecology and Resource Management
University of Washington, Seattle, WA
Overview
§1 Genetic mixtures
• Introductory genetics
• Latent data formulation
• Gibbs sampler for the posterior distribution
§2 A model for population admixture
§3 Simultaneous mixture/admixture analysis
• Difficult full conditional calculations
• A Baum et al. type of computation
§4 Results
Rapid River Fish Hatchery Example—Genetic Mixture
The Big
Question:
Can we separate the
unmarked hatchery
fish from the wild fish
using genetic
markers?
And also estimate
allele frequencies in
the two different
populations?
• Fish caught at the hatchery trap
are a mixture of wild-spawning and
hatchery fish.
• 95% of hatchery fish are adipose
fin-clipped (i.e., “marked”).
Adipose Fin
Genetics Background
Each salmon cell has about 64 Pairs of Chromosomes
Very precise locations in the genome
may be reliably found and analyzed.
Such a location is called a
LOCUS (plural = LOCI).
Genetic variants at a locus are
known as alleles.
Each individual has two copies of genetic material at a locus which
determine its single-locus genotype.
The probability that an individual carries a particular allele at a
locus depends on how frequent that allele is in the population.
The alleles carried at one locus are typically (and often quite reasonably) assumed to be independent of the alleles carried at other loci.
Wright-Fisher
Population Model
4 Loci
1
2
3
4
Gamete Pool For Locus One
The probability of the fish’s multilocus phenotype, y, is a simple function
of the population allele frequencies, θ.
A Finite Mixture Model
• The marked hatchery fish are treated as a learning sample, size n`
• The remaining fish are a sample of size nm from a mixture
– An unkown proportion π0 in the mixture are wild fish
– π1 are unmarked hatchery fish (π0 + π1 = 1)
• Allele frequencies at all the loci available are unknown and denoted by
θ0 and θ1 for the two populations (wild and hatchery), respectively.
• The probability of the phenotype of the ith fish conditional on its being
from Population j is P (yi|θj ), j = 0, 1.
• The likelihood function is
(
)
n
n
1
m
Y X
Ỳ
πj P (yi|θj ) ·
P (yi|θ1)
P (y|π, θ) =
i=1
j=0
i=1
The Latent Data Perspective
Something is missing here. . . Something that
would make this whole, messy enterprise trivially
easy. . .
I am a
ILD fis h
W
me is Ge o e
rg
Hell o, M y na
I am a
ILD fis h
W
HATCHERY
Posterior Probability of Population Origin
We wish to distinguish wild fish from unmarked hatchery fish—This is
typically impossible to do with certainty.
• We may compute the probability of a fish’s origin:
π0P (yi|π0, θ0)
,
P (fish i Wild|yi, π, θ) =
π0P (yi|π0, θ0) + π1P (yi|π1, θ1)
if π and θ are known. . . But they aren’t!
• We take a Bayesian approach—inference based on the posterior
distribution (Diebolt and Robert 1994):
• Posterior probability of population origin is
Z
P (fish i Wild|yi, π, θ)P (π, θ|y)dπdθ
P (fish i Wild|y) =
π,θ
Which is impossible to compute directly. But note:
¯ ¶
µ
¯
= E P (fish i Wild|yi, π, θ)¯¯y
Which suggests a Monte Carlo estimator
M
1 X
P (fish i Wild|yi, π (k), θ(k))
=
M
k=1
with (π (k), θ(k)) ∼ P (π, θ|y).
• But, we must simulate each (π (k), θ(k)) from the posterior distribution.
1
P (π, θ|y) = ρ(π)ρ(θ)P (y|π, θ)
C
• The prior ρ(π) is a Beta distribution and the prior for allele frequencies
at each locus are Dirichlet distributions.
– Often taken to be uniform
• C is typically unknown. But MCMC is an option.
Latent Data Revisited
Let the ith fish’s flag be zi = 0 for wild origin or zi = 1 for hatchery origin
• Use Gibbs sampling to simulate values (π (k), θ(k), z (k)), k = 1, . . . , M
then use the values of π (k) and θ(k) that come out.
• Consecutively simulate new values from their full conditional distributions:
– Update all the zi’s (i = 1, . . . , nm).
– Update π
– Update θ
• The full conditionals are simple distributions because of conjugacy and
the use of latent variables.
P (zi = 0| · · ·) = P (zi = 0|yi, π, θ) = P (fish i Wild|yi, π, θ)
(π0| · · ·) = (π0|z) ∼ Beta(1 + #{z = 0}, 1 + #{z = 1})
• θ0 and θ1 have full conditional distributions which are Dirichlet
distributions arising from standard multinomial sampling of allelic types.
What This Does/Doesn’t Offer You
Subject to the assumptions made in choosing the prior distributions for π
and θ this gets you:
• Estimates of the posterior probability of population of origin of each fish
in the mixture.
• Estimates of the posterior distribution of allele frequencies of the wild
and hatchery populations
• An estimate of the posterior distribution of π
However, the model is rather inflexible in assuming that every individual is
purely descended from either the wild population or the hatchery population.
A Model For Admixed Populations
(Pritchard et al., in press)
q0i ∼ Beta(α, α)
For the ith fish:
q0i
q0i + q1i = 1
q1i
Hatchery "Gene Pool"
Wild "Gene Pool"
• The goal is to estimate q0i for the ith fish (i = 1, . . . , nm),
the posterior probability of population of origin for each gene
copy, and θ.
Latent Data, w, for the Admixture Model
• The tth gene copy in the ith fish gets wit = 0 or 1, (t = 1, . . . , G)
0
0
Knowing w and q renders simple the
1
calculation of many quantities;
0
Particularly the full conditionals.
1
1
0
0
Gibbs Sampling in the Admixture Model
• Perform a Gibbs sampling sweep as follows, updating from the full
conditional distributions:
– Update all the q0i’s, (i = 1, . . . , nm)
– Update θ
– Update all the wit’s , (i = 1, . . . , nm ; t = 1, . . . , G)
• (q0i| · · ·) = (q0i|wi, α) ∼ Beta(α + #{wi = 0}, α + #{wi = 1})
• Allele frequency full conditionals remain simple Dirichlet distributions
• Furthermore for the ith fish, knowing q0i makes it easy to compute the
full conditional for each wit:
P (wit = 0| · · ·) = P (wit = 0|y, θ, q) =
q0iP (yit|θ0)
q0iP (yit|θ0) + q1iP (yit|θ1)
What The Admixture Model Delivers
Subject to the assumptions made in choosing the prior distributions for θ
and q (recall that the prior on q is determined by α) this gets you:
• Estimates of the posterior probability of population of origin of each copy
of each gene within each fish in the sample of unmarked fish.
• Estimates of the posterior distribution of allele frequencies of the wild
and hatchery populations
• An estimate of the posterior distribution of q0i for the ith fish
This is great for populations which have been interbreeding for a long time.
However, this is not entirely appropriate if many individuals in the mixture
are of pure descent (i.e., non-hybrids).
Merging the Mixture and Admixture Models
Clearly we desire a probability model in which some fish may be either
“pure” or ”admixed”
Conceptually this is easy. Declare:
• Each fish is of pure origin with unknown probability ξP or admixed with
(ξP + ξA = 1)
unknown probability ξA.
The probability of the ith fish’s genotype is a weighted average of its
probability under the mixture and admixture models.
• Simulation from the posterior distribution is again possible with Gibbs
sampling
• Requires one final piece of latent data, vi, for the ith fish.
Gibbs Sampling in the Merged Model
Pure Descent
vi = P
Additional latent
datum is zi
Admixed Descent
vi = A
Additional latent data
are wit’s and q0i
ξP P (yi|vi = P, · · ·)
P (vi = P | · · ·) =
ξP P (yi|vi = P, · · ·) + ξAP (yi|vi = A, · · ·)
X
Updating vi Correctly
Imagine a model without the latent data. Then:
P (vi = A|yi, π, θ, α, ξ) =
ξAP (yi|θ, α, vi = A)
ξAP (yi|θ, α, vi = A) + ξP P (yi|θ, π, vi = P )
With latent data, we may update vi and the “additional latent data” (ALD)
jointly from the joint full conditional by the following factorization:
P (vi, ALD| · · ·) = P (vi|yi, π, θ, α, ξ)P (ALD|yi, π, θ, α, ξ, vi)
The main intricacy lies in computing P (yi|π, θ, π, α, vi) which is a marginal
distribution achieved by integrating and/or summing out the additional
latent data (either q0i and the wit’s or zi).
Integrating Out the Latent Data
• Easy for vi = P
X
P (yi|π, θ, vi = P ) =
P (yi, zi|π, θ, vi = P )
z∈{0,1}
1
X
=
P (yi|θ, vi = P, zi = j)P (zi = j|π)
j=0
• More difficult for vi = A
Z
P (yi|θ, α, vi = A) =
0
1X
P (yi|θ, wi)P (wi|q0i)P (q0i|α)dq0i
w
But there is a neat way to compute this using some facts about the
beta-binomial distribution, Pólya-Eggenberger urn schemes, and hidden
Markov chains.
An Urn Scheme Interpretation
• First imagine that the genotypes are not known, and consider the
distribution of the wit’s given only α.
• Recall that wit|q0i ∼ Bernoulli(q0i) and that q0i ∼ Beta(α, α) so
Z
1
P (wi|α) =
0
α−1
q0i
(1 − q0i)α−1
G
Y
wit
q0i
(1 − q0i)1−wit dq0i
t=1
• For illustration, take α = 1
1
2
2
3
3
4
5
1
×
×
×
×
×
×
×
2
3
4
5
6
7
8
9
0
1
1
0
0
1
1
1
1
1
2
3
3
3
4
5
2
3
4
5
6
7
8
9
dt
A Markov Chain Interpretation
• Sampling wit’s from this urn scheme defines a time inhomogeneous
Markov chain on the pairs (wit, dt)
(wi1, d1) −→ (wi2, d2) −→ (wi3, d3) −→ · · · −→ (wi7, d7) −→ (wi8, d8)
• Ultimately, though, we must include the data yi.
• This gives us a hidden Markov chain
(wi1, d1) −→ (wi2, d2) −→ (wi3, d3) −→ · · · −→ (wi7, d7) −→ (wi8, d8)
yi7
−→
yi3
−→
yi2
−→
−→
−→
yi1
y i8
• The Baum et al. family of algorithms provides an efficient way to sum
out the wit’s, giving us P (yi|θ, α, vi = A), as desired.
Results
• Dataset kindly provided by Paul Moran (Nat’l Marine Fisheries Service):
– 49 Marked hatchery fish (learning sample)
– 49 Unmarked fish (sample from the mixture)
– 20 loci (with between 2 and 16 alleles)
• Found no strong evidence for the existence of two separate populations
in the mixture. Either:
– Putative wild population has allele frequencies similar to hatchery’s
– Putative wild population no longer exists
• Fortunately the methods/software are applicable to situations in many
different species
Scottish Wildcat Data and Analysis
• Data kindly provided to me by Mark Beaumont (Reading, UK)
– 239 “Wild-living” cats
– 65 Domestic cats collected at veterinary centers
– 9 Moderately polymorphic microsatellite loci
• Analysis
– Assumed we had no learning sample
– Uniform priors on ξ, π, and q (i.e., α = 1)
– 10,000 Sweeps of burn-in
– 100,000 Sweeps for inference (took about 2 hours)
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-10000
-5000
0
5000
10000
15000
20000
Sweep Number
Posterior Probability
Realized Value of Xi_P
Mixing and Inference for ξ
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
0
0.1
0.2
0.3
Xi_A
0.4
0.5
and
0.6
0.7
0.8
Xi_P
0.9
1
P(Pure | y) with 90% Credible Set
Individual Posterior Probabilities of Type of Descent
P (vi = P |y)
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
304 Cats Ordered by P(Pure | y)
P(From Source 0 | PURE, y)
P (vi = P |y) vs. P (zi = 0|y, vi = P )
1
0.9
0.8
0.7
0.6
Collected Wild
0.5
J Collected at Vets
0.4
J
0.3
0.2
0.1
J
J
0
0
0.1
0.2
0.3
J
0.4
J
0.5
J J JJJJ J
0.6
0.7
P(PURE | y) for 304 Cats
JJJJJ JJJJJJJJJJJJJJJJJ
JJJJJ
JJJJJJJ
0.8
0.9
1
P(From Source 0 | PURE, y)
P (vi = P |y) vs. P (zi = 0|y, vi = P )
(Zoomed In and Jiggled)
1
0.9
0.8
0.7
0.6
Collected Wild
0.5
J Collected at Vets
0.4
0.3
0.2
0.1
0 J
0.85
JJ J JJJ
0.87
JJ J JJ J JJJJ J J J J JJJJJJJJJJJJJ JJ JJJJ
0.89
0.91
P(PURE | y)
0.93
J
0.95
0.97
Posterior Probability of First-Generation Hybrids
P(F1 Hybrid | y)
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Average q_0 | Admixed , y
0.9
1
MCMC and Gibbs Sampling
• MCMC is just like standard Monte Carlo except that the simulated values
are states visited by a Markov chain.
• Requires knowing the target distribution “up to a normalizing constant”
• Gibbs Sampling: constructing multi-dimensional Markov chains
– Successively update individual components/variables of a probability
model conditional on the current state of all remaining variables in the
model.
– Full conditional distributions: P (X| · · ·)
– Works best when the full conditionals are quickly computed.
Summary
• Models of genetic mixture and genetic admixture
• Model for “simultaneous” mixture and admixture
• Bayesian inference via MCMC (Gibbs sampling)
• Enjoyable full conditional calculations for the “merged model”
• Salmon data: a fine demonstration that there must be genetic differences
between the populations in question if one wishes to separate individuals.
• Scottish wildcat data: admirable performance of the method.
– Extension to more specific models of admixture.