Download Maximum-likelihood and Bayesian parameter estimation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Maximum-likelihood and Bayesian parameter
estimation
Andrea Passerini
[email protected]
Machine Learning
Maximum-likelihood and Bayesian parameter estimation
Parameter estimation
Setting
Data are sampled from a probability distribution p(x, y)
The form of the probability distribution p is known but its
parameters are unknown
There is a training set D = {(x1 , y1 ), . . . , (xm , ym )} of
examples sampled i.i.d. according to p(x, y)
Task
Estimate the unknown parameters of p from training data D.
Note: i.i.d. sampling
independent: each example is sampled independently
from the others
identically distributed: all examples are sampled from the
same distribution
Maximum-likelihood and Bayesian parameter estimation
Parameter estimation
Multiclass classification setting
The training set can be divided into D1 , . . . , Dc subsets,
one for each class (Di = {x1 , . . . , xn } contains i.i.d
examples for target class yi )
For any new example x (not in training set), we compute
the posterior probability of the class given the example and
the full training set D:
P(yi |x, D) =
p(x|yi , D)p(yi |D)
p(x|D)
Note
same as Bayesian decision theory (compute posterior
probability of class given example)
except that parameters of distributions are unknown
a training set D is provided instead
Maximum-likelihood and Bayesian parameter estimation
Parameter estimation
Multiclass classification setting: simplifications
P(yi |x, D) =
p(x|yi , Di )p(yi |D)
p(x|D)
we assume x is independent of Dj (j 6= i) given yi and Di
without additional knowledge, p(yi |D) can be computed as
the fraction of examples with that class in the dataset
the normalizing factor p(x|D) can be computed
marginalizing p(x|yi , Di )p(yi |D) over possible classes
Note
We must estimate class-dependent parameters θ i for:
p(x|yi , Di )
Maximum-likelihood and Bayesian parameter estimation
Maximum Likelihood vs Bayesian estimation
Maxiumum likelihood/Maximum a-posteriori estimation
Assumes parameters θ i have fixed but unknown values
Values are computed as those maximizing the probability
of the observed examples Di (the training set for the class)
Obtained values are used to compute probability for new
examples:
p(x|yi , Di ) ≈ p(x|θ i )
Maximum-likelihood and Bayesian parameter estimation
Maximum Likelihood vs Bayesian estimation
Bayesian estimation
Assumes parameters θ i are random variables with some
known prior distribution
Observing examples turns prior distribution over
parameters into a posterior distribution
Predictions for new examples are obtained integrating over
all possible values for the parameters:
Z
p(x|yi , Di ) =
p(x, θ i |yi , Di )dθ i
θi
Maximum-likelihood and Bayesian parameter estimation
Maxiumum likelihood/Maximum a-posteriori estimation
Maximum a-posteriori estimation
θ ∗i = argmaxθ i p(θ i |Di , yi ) = argmaxθ i p(Di , yi |θ i )p(θ i )
Assumes a prior distribution for the parameters p(θ i ) is
available
Maximum likelihood estimation (most common)
θ ∗i = argmaxθ i p(Di , yi |θ i )
maximizes the likelihood of the parameters with respect to
the training samples
no assumption about prior distributions for parameters
Note
Each class yi is treated independently: replace yi , Di → D
for simplicity
Maximum-likelihood and Bayesian parameter estimation
Maximum-likelihood (ML) estimation
Setting (again)
A training data D = {x1 , . . . , xn } of i.i.d. examples for the
target class y is available
We assume the parameter vector θ has a fixed but
unknown value
We estimate such value maximizing its likelihood with
respect to the training data:
θ ∗ = argmaxθ p(D|θ) = argmaxθ
n
Y
p(xj |θ)
j=1
The joint probability over D decomposes into a product as
examples are i.i.d (thus independent of each other given
the distribution)
Maximum-likelihood and Bayesian parameter estimation
Maximum-likelihood estimation
Maximizing log-likelihood
It is usually simpler to maximize the logarithm of the
likelihood (monotonic):
∗
θ = argmaxθ ln p(D|θ) = argmaxθ
n
X
ln p(xj |θ)
j=1
Necessary conditions for the maximum can be obtained
zeroing the gradient wrt to θ:
∇θ
n
X
ln p(xj |θ) = 0
j=1
Points zeroing the gradient can be local or global maxima
depending on the form of the distribution
Maximum-likelihood and Bayesian parameter estimation
Maximum-likelihood estimation
Gaussian case: unknown µ known Σ
the log-likelihood is:
n
X
ln p(xj |θ) =
j=1
n
X
j=1
1
1
− (xj − µ)t Σ−1 (xj − µ) − ln (2π)d |Σ|
2
2
The gradient wrt to the mean is:
∇µ
n
X
ln p(xj |θ) =
j=1
n
X
Σ−1 (xj − µ)
j=1
Setting the gradient to zero gives:
n
X
j=1
n
Σ
−1
∗
(xj − µ ) = 0
⇒
1X
µ =
xj
n
∗
j=1
Maximum-likelihood and Bayesian parameter estimation
Maximum-likelihood estimation
Univariate Gaussian case: unknown µ and σ 2
the log-likelihood is:
L=
n
X
−
j=1
1
1
(xj − µ)2 − ln 2πσ 2
2
2
2σ
The gradient is:
"
∇µ,σ2 L =
Pn
1
j=1 σ 2 (xj − µ)
Pn
(xj −µ)2
1
j=1 − 2σ 2 + 2σ 4
#
Maximum-likelihood and Bayesian parameter estimation
Maximum-likelihood estimation
Univariate Gaussian case: unknown µ and σ 2
Setting the gradient to zero gives mean:
n
1X
µ∗ =
xj
n
j=1
and variance:
n
n
X
X
(xj − µ∗ )2
1
=
2σ 2
2σ 4
j=1
j=1
n
X
σ2 =
j=1
n
X
(xj − µ∗ )2
j=1
n
σ2 =
1X
(xj − µ∗ )2
n
j=1
Maximum-likelihood and Bayesian parameter estimation
Maximum-likelihood estimation
Multivariate Gaussian case: unknown µ and Σ
the log-likelihood is:
n
X
j=1
1
1
− (xj − µ)t Σ−1 (xj − µ) − ln (2π)d |Σ|
2
2
The maximum-likelihood estimates are:
n
µ∗ =
1X
xj
n
j=1
and:
n
Σ=
1X
(xj − µ∗ )(xj − µ∗ )t
n
j=1
Maximum-likelihood and Bayesian parameter estimation
Maximum-likelihood estimation
general Gaussian case:
Maximum likelihood estimates for Gaussian parameters
are simply their empirical estimates over the samples:
Gaussian mean is the sample mean
Gaussian covariance matrix is the mean of the sample
covariances
Maximum-likelihood and Bayesian parameter estimation
Bayesian estimation
setting (again)
Assumes parameters θ i are random variables with some
known prior distribution
Predictions for new examples are obtained integrating over
all possible values for the parameters:
Z
p(x|yi , Di ) =
p(x, θ i |yi , Di )dθ i
θi
probability of x given each class yi is independent of the
other classes yj , for simplicity we can again write:
Z
p(x|yi , Di ) → p(x|D) =
p(x, θ|D)dθ
θ
where D is a dataset for a certain class y and θ the
parameters of the distribution
Maximum-likelihood and Bayesian parameter estimation
Bayesian estimation
setting
Z
p(x|D) =
θ
Z
p(x, θ|D)dθ =
p(x|θ)p(θ|D)dθ
p(x|θ) can be easily computed (we have both form and
parameters of distribution, e.g. Gaussian)
need to estimate the parameter posterior density given the
training set:
p(θ|D) =
p(D|θ)p(θ)
p(D)
Maximum-likelihood and Bayesian parameter estimation
Bayesian estimation
denominator
p(θ|D) =
p(D|θ)p(θ)
p(D)
p(D) is a constant independent of θ (i.e. it will no influence
final Bayesian decision)
if final probability (not only decision) is needed we can
compute:
Z
p(D) =
p(D|θ)p(θ)dθ
θ
Maximum-likelihood and Bayesian parameter estimation
Bayesian estimation
Univariate normal case: unknown µ, known σ 2
Examples are drawn from:
p(x|µ) ∼ N(µ, σ 2 )
The Gaussian mean prior distribution is itself normal:
p(µ) ∼ N(µ0 , σ02 )
The Gaussian mean posterior given the dataset is
computed as:
n
p(µ|D) =
Y
p(D|µ)p(µ)
=α
p(xj |µ)p(µ)
p(D)
j=1
where α = 1/p(D) is independent of µ
Maximum-likelihood and Bayesian parameter estimation
Univariate normal case: unknown µ, known σ 2
a posteriori parameter density
p(xj |µ)
p(µ)
z
{z
{
" }| " }| 2 #
2 #
n
Y
1 xj − µ
1 µ − µ0
1
1
√
√
exp −
exp −
p(µ|D) = α
2
σ
2
σ0
2πσ
2πσ0
j=1



2 2
n 1 X µ − xj
µ − µ0 
0

= α exp −
+
2
σ
σ0
j=1



 
n
X
1
1
n
1
µ
0
= α00 exp −  2 + 2 µ2 − 2  2
xj + 2  µ
2
σ
σ
σ0
σ
0
j=1
Normal distribution
1
"
1
p(µ|D) = √
exp −
2
2πσn
µ − µn
σn
2 #
Maximum-likelihood and Bayesian parameter estimation
Univariate normal case: unknown µ, known σ 2
recovering mean and variance
1
n
+ 2
σ2
σ0


2
n
X
1
µ0
µ − µn
µ2 − 2  2
xj + 2  µ + α000 =
σ
σn
σ0
j=1
=
Solving for µn and σn2 we obtain:
!
nσ02
σ2
µn =
µ̂
+
µ0
n
nσ02 + σ 2
nσ02 + σ 2
1 2
µn
µ2n
µ
−
2
µ
+
σn2
σn2
σn2
σn2 =
σ02 σ 2
nσ02 + σ 2
where µ̂n is the sample mean:
n
µ̂n =
1X
xj
n
j=1
Maximum-likelihood and Bayesian parameter estimation
Univariate normal case: unknown µ, known σ 2
Interpreting the posterior
µn =
nσ02
nσ02 + σ 2
!
µ̂n +
σ2
µ0
nσ02 + σ 2
σn2 =
σ02 σ 2
nσ02 + σ 2
The mean is a linear combination of the prior (µ0 ) and
sample means (µ̂n )
The more training examples (n) are seen, the more sample
mean (unless σ02 = 0) dominates over prior mean.
The more training examples (n) are seen, the more
variance decreases making the distribution sharply peaked
over its mean:
σ02 σ 2
σ2
=
lim
=0
n→∞ n
n→∞ nσ 2 + σ 2
0
lim
Maximum-likelihood and Bayesian parameter estimation
Univariate normal case: unknown µ, known σ 2
Mean posterior distribution varying sample size
p(µ|x1,x2,...,xn)
p(µ |x1,x2,...,xn)
30
3
20
50
2
24
1
1
10
12
0
-2
5
1
-1
-4
5
-2
0
0
1
0
1
-1
-2
1
2
4
µ
-3
2
-4
Maximum-likelihood and Bayesian parameter estimation
Conjugate priors
Definition
Given a likelihood function p(x|θ)
Given a prior distribution p(θ)
p(θ) is a conjugate prior for p(x|θ) if the posterior
distribution p(θ|x) is in the same family as the prior p(θ)
Examples
Likelihood
Binomial
Multinomial
Normal
Multivariate normal
Parameters
p (probability)
p (probability vector)
µ (mean)
µi (mean vector)
Conjugate prior
Beta
Dirichlet
Normal
Normal
Maximum-likelihood and Bayesian parameter estimation
Univariate normal case: unknown µ, known σ 2
Computing the class conditional density
Z
p(x|D) =
p(x|µ)p(µ|D)dµ
"
"
2 #
2 #
Z
1
1
1 x −µ
1 µ − µn
√
√
=
dµ
exp −
exp −
2
σ
2
σn
2πσ
2πσn
∼ N(µn , σ 2 + σn2 )
Note (proof omitted)
the probability of x given the dataset for the class is a
Gaussian with:
mean equal to the posterior mean
variance equal to the sum of the known variance (σ 2 ) and
an additional variance (σn2 ) due to the uncertainty on the
mean
Maximum-likelihood and Bayesian parameter estimation
Multivariate normal case: unknown µ, known Σ
Generalization of univariate case
p(x|µ) ∼ N(µ, Σ)
p(µ) ∼ N(µ0 , Σ0 )
⇓
p(µ|D) ∼ N(µn , Σn )
⇓
p(x|D) ∼ N(µn , Σ + Σn )
Maximum-likelihood and Bayesian parameter estimation
Sufficient statistics
Definition
Any function on a set of samples D is a statistic
A statistic s = φ(D) is sufficient for some parameters θ if:
P(D|s, θ) = P(D|s)
If θ is a random variable, a sufficient statistic contains all
relevant information D has for estimating it:
p(θ|D, s) =
p(D|θ, s)p(θ|s)
= p(θ|s)
p(D|s)
Use
A sufficient statistic allows to compress a sample D into
(possibly few) values
Sample mean and covariance are sufficient statistics for
true mean and covariance of the Gaussian distribution
Maximum-likelihood and Bayesian parameter estimation
Bernoulli distribution
Setting
Boolean event: x = 1 for success, x = 0 for failure (e.g.
tossing a coin)
Parameters: θ = probability of success (e.g. head)
Probability mass function
P(x|θ) = θx (1 − θ)1−x
Beta conjugate prior:
P(θ|ψ) = P(θ|αh , αt ) =
Γ(α)
θαh −1 (1 − θ)αt −1
Γ(αh )Γ(αt )
Maximum-likelihood and Bayesian parameter estimation
Bernoulli distribution
Maximum likelihood estimation: example
Dataset D = {H, H, T , T , T , H, H} of N realizations (e.g.
head/tail coin toss results)
Likelihood function:
p(D|θ) = θ · θ · (1 − θ) · (1 − θ) · (1 − θ) · θ · θ = θh (1 − θ)t
Maximum likelihood parameter:
∂
∂
lnp(D|θ) = 0
⇒
h ln θ + t ln (1 − θ) = 0
∂θ
∂θ
1
1
h −t
=0
θ
1−θ
h(1 − θ) = tθ
θ=
h
h+t
h, t are the sufficient statistics
Maximum-likelihood and Bayesian parameter estimation
Bernoulli distribution
Bayesian estimation: example
Parameter posterior is proportional to:
P(θ|D, ψ) ∝ P(D|θ)P(θ|ψ) ∝ θh (1 − θ)t θαh −1 (1 − θ)αt −1
i.e. the posterior has a beta distribution with parameters
h + αh , t + αt :
P(θ|D, ψ) ∝ θh+αh −1 (1 − θ)t+αt −1
The prediction for a new event is:
Z
Z
P(x|D) = P(x|θ)P(θ|D, ψ)dθ = θP(θ|D, ψ)dθ
= EP(θ|D,ψ) [θ] =
h + αh
h + t + αh + αt
Maximum-likelihood and Bayesian parameter estimation
Bernoulli distribution
Interpreting priors
Our prior knowledge is encoded a number α = αh + αt of
imaginary experiments
we assume αh times we observed heads
α is called equivalent sample size
α → 0 reduces estimation to the classical ML approach
(frequentist)
Maximum-likelihood and Bayesian parameter estimation
Multinomial distribution
Setting
Categorical event with r states x ∈ {x 1 , . . . , x r } (e.g.
tossing a six-faced dice)
One-hot encoding z(x) = [z1 (x), . . . , zr (x)] with zk (x) = 1
if x = x k , 0 otherwise.
Parameters: θ = [θ1 , . . . , θr ] probability of each state
Probability mass function
P(x|θ) =
r
Y
z (x)
θk k
k=1
Dirichlet conjugate prior:
r
Y
Γ(α)
P(θ|ψ) = P(θ|α1 , . . . , αr ) = Qr
θkαk −1
Γ(α
)
k
k=1
k=1
Maximum-likelihood and Bayesian parameter estimation
Multinomial distribution
Maximum likelihood estimation: example
Dataset D of N realizations (e.g. results of tossing a dice)
Likelihood function:
p(D|θ) =
N Y
r
Y
z (x )
θk k j
=
j=1 k=1
r
Y
θkNk
k=1
Maximum likelihood parameter:
θk =
Nk
N
N1 , . . . , Nr are the sufficient statistics
Maximum-likelihood and Bayesian parameter estimation
Multinomial distribution
Bayesian estimation: example
Parameter posterior is proportional to:
P(θ|D, ψ) ∝ P(D|θ)P(θ|ψ) ∝
r
Y
θkNk θkαk −1
k=1
i.e. the posterior has a Dirichlet distribution with
parameters Nk + αk , k = 1, . . . , r :
P(θ|D, ψ) ∝
r
Y
θkNk +αk −1
k=1
The prediction for a new event is:
Z
N + αk
P(xk |D) = θk P(θ|D, ψ)dθ = EP(θ |D,ψ) [θk ] = k
N +α
Maximum-likelihood and Bayesian parameter estimation