Download 1 Maximum Likelihood Estimation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mixture model wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Transcript
heads
tails
L(θ :D)
Figure 1: A simple thumbtack tossing experiment.
0
0.2
0.4
0.6
0.8
1
Figure 2: The likelihood function for the sequence of tosses H, T, T, H, H.
1 Maximum Likelihood Estimation
In this section, we describe the basic principles behind maximum likelihood estimation.
1.1 The Thumbtack Example
I. I . D . SAMPLES
We start with what may be considered the simplest learning problem: parameter learning for a single variable. This is a classical “Statistics 101” problem which illustrates
some of the issues that we shall encounter in more complex learning problems. Surprisingly, this simple problem already contains some interesting issues that we need to
tackle.
Imagine that we have a thumbtack, and we conduct an experiment whereby we flip
the thumbtack in the air, and it comes to land as either heads or tails, as in Figure 1. We
toss the coin several times, obtaining a data set consisting of heads or tails outcomes.
Based on this data set, we want to estimate the probability with which the next flip will
land heads or tails.
In this description, we already made the implicit assumptions that the thumbtack
tosses are controlled by an (unknown) parameter θ, which describes the frequency of
heads in thumbtack tosses. In addition, we need to assume that the tosses are independent of each other. That is, the outcome of a toss is not affected by the outcomes of
previous tosses: the thumbtack does not “remember” the previous flips. Data instances
satisfying these assumptions are often referred to as independent and identically distributed (i.i.d.) samples.
Assume that we toss the thumbtack 100 times, of which 35 come up heads. What is
our estimate for θ? Our intuition suggests that the best estimate is 0.35. Had θ had been
0.1, for example, our chances of seeing 35/100 heads would be much lower. In fact,
1
we examined a similar situation in our discussion of sampling methods in Section ??,
where we used samples from a distribution to estimate the probability of a query. As we
discussed, the central limit theorem shows that, as the number of coin tosses grows, it is
increasingly unlikely to sample a sequence of i.i.d. thumbtack flips where the fraction
of tosses that come out heads is very far from θ. Thus, for sufficiently large M , the
fraction of heads among the tosses is a good estimate with high probability.
To formalize this intuition, assume that we have a set of thumbtack tosses x[1], . . . , x[M ]
that are i.i.d., that is, each is sampled independently from the same distribution in which
X[m] is equal to H (heads) or T (tails) with probability θ and 1 − θ, respectively. Our
task is to find a good value for the parameter θ. As in many formulations of learning
tasks, we define a hypothesis space Θ — a set of possibilities that we are considering,
and a scoring function that tells us how good different hypotheses in the space are relative to our data set D. In this case, our hypothesis space Θ is the set of all parameters
θ ∈ [0, 1].
How do we score different possible parameters θ? One way of evaluating θ is by
how well it predicts the data. In other words, if the data is likely given the parameter,
the parameter is a good predictor. For example, suppose we observe the sequence of
outcomes H, T, T, H, H. If we know θ, we could assign a probability to observing
this particular sequence. The probability of the first toss is P (X[1] = H) = θ. The
probability of the second toss is P (X[2] = T | X[1] = H), but our assumption that
the coin tosses are independent allows us to conclude that this probability is simply
P (X[2] = T ) = 1 − θ. This is also the probability of the third outcome, and so on.
Thus, the probability of the sequence is
P (H, T, T, H, H : θ) = θ(1 − θ)(1 − θ)θθ = θ3 (1 − θ)2 .
As expected, this probability depends on the particular value θ. As we consider different values of θ, we get different probabilities for the sequence. Thus, we can examine
how the probability of the data changes as a function of θ. We define the likelihood
function to be
L(θ : H, T, T, H, H) = P (H, T, T, H, H : θ) = θ3 (1 − θ)2 .
M AXIMUM L IKELIHOOD
E STIMATOR
Figure 2 plots the likelihood function in our example.
Clearly, parameter values with higher likelihood are more likely to generate the
observed sequences. Thus, we can use the likelihood function as our measure of quality for different parameter values, and select the parameter value that maximizes the
likelihood. Such an estimator is called Maximum Likelihood Estimator (MLE). By
viewing Figure 2 we see that θ̂ = 0.6 = 3/5 maximizes the likelihood for the sequence
H, T, T, H, H.
Can we find the MLE for the general case? Assume that our data set D of observations contains Mh heads and Mt tails. We want to find the value θ̂ that maximizes the
likelihood of θ relative to D. The likelihood function in this case is:
L(θ : D) = θMh (1 − θ)Mt .
LOG - LIKELIHOOD
It turns out that it is easier to maximize the logarithm of the likelihood function. In our
case, the log-likelihood function is:
2
(θ : D) = Mh log θ + Mt log(1 − θ).
Note that the log-likelihood is monotonically related to the likelihood. Therefore, maximizing the one is equivalent to maximizing the other. However, the log-likelihood is
more convenient to work with, as products are converted to summations.
Differentiating the log-likelihood, setting the derivative to 0, and solving for θ, we
get that the maximum likelihood parameters, which we denote θ̂, is
θ̂ =
Mh
Mh + Mt
(1)
as expected.
As we shall see, the maximum likelihood approach has many advantages. However,
the approach also has some limitations. For example, if we get 3 heads out of 10 tosses
the MLE estimate is 0.3. We get the same estimate if we get 300 heads out of 1000
tosses. Clearly, the two experiments are not equivalent. Our intuition is that, in the
second experiment, we should be more confident of our estimate. Indeed, statistical
estimation theory deals with confidence intervals. These are common in news reports,
e.g., when describing the results of election polls, where we often hear that “61%±2%”
plan to vote for a certain candidate. The 2% is a confidence interval — the poll is
designed so as to select enough people so that the MLE estimate will be within 0.02 of
the true parameter, with high probability. Exercise ?? expands on this topic.
1.2 The Maximum Likelihood Principle
T RAINING S ET
PARAMETRIC M ODEL
PARAMETERS
We start by describing the setting of the learning problem. Assume that we observe
several i.i.d. samples of a set of random variables X from an unknown distribution
P ∗ (X ). We assume we know in advance the sample space we are dealing with (i.e.,
which random variables, and what values they can take). However, we do not make
any additional assumptions about P ∗ . We denote the training set of samples as D and
assume it consists of M instances of X : ξ[1], . . . ξ[M ].
Next, we need to consider what exactly we want to learn. We assume that we
are given a parametric model for which we wish to estimate parameters. Formally, a
model is a function P (ξ : θ) that, given a set of parameter values θ and an instance ξ
of X , assigns a probability (or density) to ξ. Of course, we require that for each choice
of parameters θ, P (ξ : θ) is a legal distribution; that is, it is non-negative and
P (ξ : θ) = 1.
ξ
PARAMETER SPACE
In general, for each model, not all parameter values are legal. Thus, we need to define
the parameter space Θ, which is the set of allowable parameters.
To get some intuition, we consider concrete examples. The model we examined in
Section 1.1 has parameter space Θthumbtack = [0, 1], and is defined as
θ
if x = H
Pthumbtack (x : θ) =
1 − θ if x = T
There are many additional examples.
3
Example 1.1: Suppose that X is a multinomial variable that can take values x1 , . . . , xK .
The simplest representation of a multinomial distribution is as a vector θ ∈ IRK , such
that
Pmultinomial (x : θ) = θk if x = xk .
The parameter space of this model is
Θmultinomial =
θ ∈ [0, 1]K :
θi = 1 .
i
Example 1.2: Suppose that X is a continuous variable that can take values in the real
line. A Gaussian model for X is
(x−μ)2
1
PGaussian (x : μ, σ) = √
e− 2σ2
2πσ
where θ = μ, σ. The parameter space for this model is ΘGaussian = IR × IR+ . That
is, we allow any real value of μ and any positive real value for σ.
LIKELIHOOD
The next step in maximum likelihood estimation is defining the likelihood function.
As we saw in our example, the likelihood function, for a given choice of parameters θ,
is the probability (or density) the model assigns the training data:
P (ξ[m] : θ)
L(θ : D) =
m
In the thumbtack example, we have seen that we can write the likelihood function
using simpler terms. That is, using the counts Mh and Mt , we managed to have a
compact description of the likelihood. More precisely, once we knew the values of
Mh and Mt , we did not need to consider other aspects of training data (e.g., the order
of tosses). These are the sufficient statistics for the thumbtack learning problem. In a
more general setting, a sufficient statistic is a function of the data that summarizes the
relevant information for computing the likelihood.
S UFFICIENT S TATISTICS
Definition 1.3: A function s(ξ) from instances of X to IR (for some ) is a sufficient
statistic if, for any two data sets D and D and any θ ∈ Θ, we have that
s(ξ[m]) =
s(ξ [m]) =⇒ L(θ : D) = L(θ : D ).
ξ[m]∈D
We often refer to the tuple
D.
ξ [m]∈D ξ[m]∈D
s(ξ[m]) as the sufficient statistics of the data set
Example 1.4: Let us reconsider the multinomial model of Example 1.1. It is easy to
see that a sufficient statistic for the data set is the tuple of counts M1 , . . . , MK , such
4
that Mk is number of times the value xk appears in the training data. To obtain these
counts by summing instance-level statistics, we define s(x) to be a tuple of dimension
K, such that s(x) has a 0 in every position, except at the position k for which x = xk ,
where its value is 1:
k−1
n−k
k
s(x ) = (0, . . . , 0, 1, 0, . . . , 0).
Given the vector of counts we can write the likelihood function as
L(D : θ) =
θkMk .
k
Example 1.5: Let us reconsider the Gaussian model of Example 1.2. In this case, it
is less obvious how to construct sufficient statistics. However, if we expand the term
(x − μ)2 in the exponent, we can rewrite the model as
PGaussian (x : μ, σ) = e−x
2
1
2σ2
μ
1
+x σμ2 − 2σ
2 − 2 log(2π)−log(σ)
We then see that the function
sGaussian (x) = 1, x, x2 is a sufficient statistic for this model. Note that the first element in the sufficient statistics tuple is “1”, which does not depend on the value of the data item; it serves, as in
the multinomial case, to count the number of data items.
MLE
Several comments about the likelihood function. First, we stress that the likelihood
function measures the effect of the choice of parameters on the training data. Thus, for
example, if we have two sets of parameters θ and θ , so that L(θ : D) = L(θ : D),
then we cannot, given only the data, distinguish between the two choices of parameters. Moreover, if L(θ : D) = L(θ : D) for all possible choices of D, then the two
parameters are indistinguishable for any outcome. In such a situation, we can say in
advance (i.e., before seeing the data) that some distinctions cannot be resolved based
on the data alone.
Second, since we are maximizing the likelihood function, we usually want it to be
continuous (and preferably smooth) function of θ. To ensure these properties, most of
the theory of statistical estimation requires that P (ξ : θ) is a continuous and differentiable function of θ, and moreover that Θ is a continuous set of points (which is often
assumed to be convex).
Once we have defined the likelihood function, we can use maximum likelihood
estimation (MLE) to choose the parameter values. Formally, we state this principle as
follows.
Maximum Likelihood Estimation: Given a data set D, choose parameters θ̂ that satisfy
L(θ̂ : D) = max L(θ : D)
θ∈Θ
5
Example 1.6: Consider estimating the parameters of a multinomial distribution of Example 1.4. As one might guess, the maximum likelihood is attained when
θ̂k =
Mk
M
(see Exercise ??). That is, the probability of each value of X corresponds to its frequency in the training data.
Example 1.7: Consider the estimating the parameters of a Gaussian distribution of
Example 1.5. It turns out that the maximum is attained when μ and σ correspond to the
empirical mean and variance of the training data:
1 x[m]
M m
1 (x[m] − μ̂)2
=
M m
μ̂ =
σ̂
(see Exercise ??).
2 Bayesian Estimation
Although the MLE approach seems plausible, it can be overly simplistic in many cases.
Assume again that we perform the thumbtack experiment and get 3 heads out of 10. It
may be quite reasonable to conclude that the parameter θ is 0.3. But what if we do the
same experiment with a dime, and also get 3 heads? We would be much less likely to
jump to the conclusion that the parameter of the dime is 0.3. Why? Because we have a
lot more experience with tossing dimes, so we have a lot more prior knowledge about
their behavior. Note that we do not want our prior knowledge to be an absolute guide,
but rather a reasonable starting assumption that allows us to counterbalance our current
set of 10 tosses, under the assumption that they may not be typical. However, if we
observe 1,000,000 tosses of the dime, of which 300,000 came out heads, then we may
be more willing to conclude that this is a trick dime, one whose parameter is closer to
0.3.
Maximum likelihood allows us to make neither of these distinctions: between a
thumbtack and a dime, and between 10 tosses and 1,000,000 tosses of the dime. There
is, however, another approach, the one recommended by Bayesian statistics.
2.0.1 Joint Probabilistic Model
In this approach, we encode our prior knowledge about θ with a probability distribution; this distribution represents how likely we are a priori to believe the different
choices of parameters. Once we quantify our knowledge (or lack thereof) about possible values of θ, we can create a joint distribution over the parameter θ and the data cases
that we are about to observe X[1], . . . , X[M ]. This joint distribution is not arbitrary; it
captures our assumptions about the experiment.
6
θ
X[1]
...
X[2]
X[m]
Figure 3: The Bayesian network for simple Bayesian parameter estimation.
Let us reconsider these assumptions. Recall that we assumed that tosses are independent of each other. Note, however, that this assumption was made when θ was
fixed. If we do not know θ, then the tosses are not marginally independent: Each toss
tells us something about the parameter θ, and thereby about the probability of the next
toss. However, once θ is known, we cannot learn about the outcome of one toss from
observing the results of others. Thus, we assume that the tosses are conditionally independent given θ. We can describe these assumptions using the Bayesian network of
Figure 3.
Having determined the model structure, it remains to specify the local probability
models in this network. We begin by considering the probability P (X[m] | θ). Clearly,
θ
if x[m] = x1
P (x[m] | θ) =
1 − θ if x[m] = x0
Note that since we now treat θ as a random variable, we use the conditioning bar,
instead of P (x[m] : θ).
To finish the description of the joint distribution, we need to describe P (θ). This
is our prior probability distribution over the value of θ. In our case, this is a continuous density over the interval [0, 1]. Before we discuss particular choices for this
distribution, let us consider how we use it.
The network structure implies that the joint distribution of a particular data set and
θ factorizes as
P (x[1], . . . , x[M ], θ) =
=
=
P (x[1], . . . , x[M ] | θ)P (θ)
P (θ)
M
P (x[m] | θ)
m=1
Mh
P (θ)θ
(1 − θ)Mt ,
where Mh is the number of heads in the data, and Mt is the number of tails. Note that
the expression P (x[1], . . . , x[M ] | θ) is simply the likelihood function L(θ : D).
This network specifies a joint probability model over parameters and data. There
are several ways in which we can use this network. Most obviously, we can take an observed data set D of M outcomes, and use it to instantiate the values of x[1], . . . , x[M ];
we can then compute the posterior probability over θ:
P (θ | x[1], . . . , x[M ]) =
P (x[1], . . . , x[M ] | θ)P (θ)
.
P (x[1], . . . , x[M ])
7
In this posterior, the first term in the numerator is the likelihood, the second is
the prior over parameters, and the denominator is a normalizing factor, that we will
not expand on right now. We see that the posterior is (proportional to) a product of
the likelihood and the prior. This product is normalized so that it will be a proper
density function. In fact, if the prior is a uniform distribution (that is, P (θ) = 1 for all
θ ∈ [0, 1]), then the posterior is just the normalized likelihood function.
2.0.2 Prediction
If we do use a uniform prior, what then is the difference between the Bayesian approach
and the MLE approach of the previous section? The main philosophical difference is
in the use of the posterior. Instead of selecting from the posterior a single value for the
parameter θ, we use it, in its entirety, for predicting the probability over the next toss.
To derive this prediction in a principled fashion, we introduce the value of the next
coin toss x[M + 1] to our network. We can then compute the probability over x[M + 1]
given the observations of the first M tosses. Note that, in this model, the parameter θ
is unknown, and we are considering all of its possible values. By reasoning over the
possible values of θ and using the chain rule we see that
P (x[M + 1] | x[1], . . . , x[M ]) =
=
P (x[M + 1] | θ, x[1], . . . , x[M ])P (θ | x[1], . . . , x[M ])dθ
=
P (x[M + 1] | θ)P (θ | x[1], . . . , x[M ])dθ,
where we use the conditional independence in the network to rewrite P (x[M + 1] |
θ, x[1], . . . , x[M ]) as P (x[M + 1] | θ). In other words, we are integrating our posterior
over θ to predict the probability of heads for the next toss.
Let us go back to our thumbtack example. Assume that our prior is uniform over
θ in the interval [0, 1]. Then P (θ | x[1], . . . , x[M ]) is proportional to the likelihood
P (x[1], . . . , x[M ] | θ) = θMh (1 − θ)Mt . Plugging this into the integral, we need to
compute
1
θ · θMh (1 − θ)Mt dθ.
P (X[M + 1] = x1 | x[1], . . . , x[M ]) =
P (x[1], . . . , x[M ])
Doing all the math (see Exercise ??), we get (for uniform priors)
P (X[M + 1] = x1 | x[1], . . . , x[M ]) =
Mh + 1
.
Mh + Mt + 2
(2)
This prediction is quite similar to the MLE prediction of Eq. (1), except that it adds
one “imaginary” sample to each count. Clearly, as the number of samples grows, the
Bayesian estimator and the MLE estimator converge to the same value. The particular estimator that corresponds to a uniform prior is often referred to as the Laplace’s
correction.
8
0
0.2
0.4
theta
0.6
0.8
1
P(theta)
Beta(10,10)
P(theta)
Beta(2,2)
P(theta)
Beta(1,1)
0
Beta(1, 1)
0.2
0.4
theta
0.6
0.8
1
0
Beta(2, 2)
0.4
theta
0.6
0.8
1
Beta(10, 10)
Beta(0.5,0.5)
P(theta)
Beta(15,10)
P(theta)
Beta(3,2)
0.2
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
theta
0.6
0.8
1
theta
Beta(3, 2)
Beta(15, 10)
0
0.2
0.4
0.6
0.8
1
Beta( 12 , 12 )
Figure 4: Examples of Beta distributions for different choices of hyperparameters.
2.0.3 Priors
B ETA DISTRIBUTION
HYPERPARAMETERS
We now want to consider non-uniform priors. The challenge here is to pick a distribution over this continuous space that we can represent compactly (e.g., using an analytic
formula), and update efficiently as we get new data. For reasons that we discuss below,
an appropriate prior in this case is the Beta distribution.
Definition 2.1: A Beta distribution is parameterized by two hyperparameters αh , αt ,
which are positive reals. The distribution is defined as follows:
θ ∼ Beta(αh , αt ) if p(θ) = γθαh −1 (1 − θ)αt −1 ,
The constant γ is a normalizing constant, defined as follows:
γ=
where Γ(x) =
∞
0
Γ(αh + αt )
Γ(αh )Γ(αt )
tx−1 e−t dt is the Gamma function.
Intuitively, the hyperparameters αh and αt correspond to the number of imaginary
heads and tails that we have “seen” before starting the experiment. Figure 4 shows
Beta distributions for different values of α.
At first glance, the normalizing constant for the Beta distribution might seem somewhat obscure. However, the Gamma function is actually a very natural one: it is simply a continuous generalization of factorials. More precisely, it satisfies the properties
Γ(1) = 1 and Γ(x + 1) = xΓ(x). As a consequence, we easily see that Γ(n + 1) = n!
when n is an integer. This function
arises directly from the integral over θ that defines
the normalizing constant, as θx dθ = x1 θx−1 , and the 1/x coefficients accumulate,
resulting in the Γ function.
9
Beta distributions have properties that make them particularly useful for parameter
estimation. Assume our distribution P (θ) is Beta(αh , αt ), and consider a single coin
toss X. Let us compute the marginal probability over X, based on P (θ). To compute
the marginal probability, we need to integrate out θ; standard integration techniques
can be used to show that:
1
P (X[1] = x1 | θ) · P (θ)dθ
P (X[1] = x1 ) =
0
1
=
0
θ · P (θ)dθ =
αh
.
αh + αt
This conclusion supports our intuition that the Beta prior indicates that we have seen
αh (imaginary) heads αt (imaginary) tails.
Now, let us see what happens as we get more observations. Specifically, we observe
Mh heads and Mt tails. It follows easily that:
P (θ | x[1], . . . , x[M ]) ∝
∝
=
P (x[1], . . . , x[M ] | θ)P (θ)
θMh (1 − θ)Mt · θαh −1 (1 − θ)αt −1
θαh +Mh −1 (1 − θ)αt +Mt −1
which is precisely Beta(αh + Mh , αt + Mt ). This result illustrates a key property of
the Beta distribution: If the prior is a Beta distribution, then the posterior distribution,
i.e., the prior conditioned on the evidence, is also a Beta distribution.
An immediate consequence is that we can compute the probabilities over the next
toss:
αh + Mh
P (X[M + 1] = x1 | x[1], . . . , x[M ]) =
α+M
where α = αh + αt . In this case, our posterior Beta distribution tells us that we have
seen αh + Mh heads (imaginary and real) and αt + Mt tails.
It is interesting to examine the effect of the prior on the probability over the next
coin toss. For example, the prior Beta(1, 1) is very different than Beta(10, 10): Although both predict that the probability of heads in the first toss is 0.5, the second
prior is more entrenched, and requires more observations to deviate from the prediction 0.5. To see this, suppose we observe 3 heads in 10 tosses. Using the first prior,
3+1
= 13 ≈ 0.33. On the other hand, using the second prior, our
our estimate is 10+2
3+10
13
estimate is 10+20 = 30 ≈ 0.43. However, as we obtain more data, the effect of the
prior diminishes. if we obtain 1000 tosses of which 300 are heads, the first prior gives
300+1
300+10
and the second an estimate of 1000+20
, both of which are
us an estimate of 1000+2
very close to 0.3. Thus, we see that the Bayesian framework allows us to capture both
of the relevant distinctions: The distinction between the thumbtack and the dime can
be captured by the strength of the prior: for a dime, we might use αh = αt = 100,
whereas for a thumbtack, we might use αh = αt = 1. The distinction between 10 and
1000 samples is captured by the peakedness of our posterior, which increases with the
amount of data.
10
2.1 Priors and Posteriors
We now turn to examine in more detail the Bayesian approach to dealing with unknown
parameters. As before, we assume a general learning problem where we observe a
training set D that contains M i.i.d. samples of a set of random variable X from an
unknown distribution P ∗ (X ). We also assume that we have a parametric model P (ξ |
θ) where we can choose parameters from a parameter space Θ.
Recall that the MLE approach attempts to find parameters θ̂ that are the parameters
in Θ that are “best” given the data. The Bayesian approach, on the other hand, does
not attempt to find such a point estimate. Instead, the underlying principle is that we
should keep track of our beliefs about θ’s values, and use these beliefs for reaching conclusions. That is, we should quantify the subjective probability we assign to different
values of θ after we have seen the evidence. Note that, in representing such subjective probabilities, we now treat θ as a random variable. Thus, the Bayesian approach
requires that we use probabilities to describe our initial uncertainty about the parameters θ, and then use probabilistic reasoning (i.e., Bayes rule) to take into account our
observations.
To perform this task, we need to describe a joint distribution P (D, θ) over the data
and the parameters. We can easily write
P (D, θ) = P (D | θ)P (θ)
PRIOR DISTRIBUTION
POSTERIOR
DISTRIBUTION
M ARGINAL L IKELIHOOD
The first term is just the likelihood function we discussed above. The second term
is the prior distribution over the possible values in Θ. The prior captures our initial
uncertainty about the parameters. It can also capture our previous experience before
starting the experiment. For example, if we study coin tossing, we might have prior
experience that suggests that most coins are unbiased (or nearly unbiased).
Once we have specified the likelihood function and the prior, we can use the data
to derive the posterior distribution over the parameters. Since we have specified a joint
distribution over all the quantities in question, the posterior is immediately derived by
Bayes rule:
P (D | θ)P (θ)
P (θ | D) =
.
P (D)
The term P (D) is the marginal likelihood of the data
P (D) = P (D | θ)P (θ)dθ
Θ
That is, the integration of the likelihood over all possible parameter assignments. This
is the a priori probability of seeing this particular dataset given our prior beliefs.
As we saw, for some probabilistic models, the likelihood function can be compactly
described by using sufficient statistics. Can we also compactly describe the posterior
distribution? In general, this depends on the form of the prior. As we saw in the
thumbtack example of Section 1.1, we can sometimes find priors for which we have a
description of the posterior.
11
As another example of the forms of priors and posteriors, let us examine the learning problem of Example 1.4. Here we need to describe our uncertainty about the parameters of a multinomial distribution. The parameter
space Θ is the space of all non
negative vectors θ = θ1 , . . . , θK such that k θk = 1. As we saw in Example 1.4,
the likelihood function in this model has the form:
L(D : θ) =
θkMk
k
D IRICHLET PRIORS
H YPERPARAMETERS
Since the posterior is a product of the prior and the likelihood, it seems natural to
require that the prior also have a form similar to the likelihood.
One such family of priors are Dirichlet priors that generalize the Beta priors we
discussed above. A Dirichlet prior is specified by a set of hyperparameters α1 , . . . , αK ,
so that
α −1
θ ∼ Dirichlet (α1 , . . . , αK ) if P (θ) ∼
θk k
k
It is easy to see that if we use a Dirichlet prior, then the posterior is also Dirichlet.
Proposition 2.2: If P (θ) is Dirichlet (α1 , . . . , αK ) then P (θ | D) is Dirichlet (α1 +
M1 , . . . , αK + MK ), where Mk is the number of occurrences of xk .
Priors such as the Dirichlet priors are useful since the ensure that the posterior has
a nice compact description. Moreover, this description is using the same representation
as the prior. This phenomenon is a general one, and one that we strive to achieve as it
makes our computation and representation much easier.
C ONJUGATE PRIORS
Definition 2.3: A family of priors P (θ : α) is conjugate to a particular model P (ξ | θ)
if for any possible dataset D of i.i.d. samples from P (ξ | θ), and any choice of legal
hyperparameters α for the prior over θ, there are hyperparameters α that describe the
posterior. That is,
P (θ : α ) ∝ P (D | θ)P (θ : α).
For example, Dirichlet priors are conjugate to the multinomial model. We note that this
does not preclude the possibility of other families that are also conjugate to the same
model. See Exercise ?? for an example of such a prior for the multinomial model. We
can find conjugate priors for other models as well. See Exercise ?? and Exercise ?? for
the development of conjugate priors for the Gaussian distribution.
This discussion shows some examples where we can easily update our beliefs about
θ after observing a set of instances D. This update process results in a posterior that
combines our prior knowledge and our observations. What can we do with the posterior? We can use the posterior to determine properties of the model at hand. For
example, to assess our beliefs that a coin we experimented with is biased toward heads,
we might compute the posterior probability that θ > t for some threshold t, say 0.6.
Another use of the posterior is to predict the probability of future examples. Suppose that we are about to sample a new instance ξ[M + 1]. Since we already have
12
observations over previous instances, our probability over a new example is
P (ξ[M + 1] | D) =
P (ξ[M + 1] | D, θ)P (θ | D)dθ
=
P (ξ[M + 1] | θ)P (θ | D)dθ
= IEP (θ|D) [[P (ξ[M + 1] | θ)]],
where, in the second step, we use the fact that instances are independent given θ. Thus,
our prediction is the average over all parameters according to the posterior.
Let us examine prediction with the Dirichlet prior. We need to compute
P (x[M + 1] = xk | D) = IEP (θ|D) [[θk ]].
To compute the prediction on a new data case, we need to compute the expectation of
particular parameters with respect tor a Dirichlet distribution over θ.
Proposition 2.4: Let P (θ) be a Dirichlet distribution with hyperparameters α1 , . . . , αk ,
then
αk
E [θk ] = k αk
Recall that our posterior is Dirichlet (α1 +M1 , . . . , αK +MK ) where M1 , . . . , Mk
are the sufficient statistics from the data. Hence, the prediction with Dirichlet priors is
P (x[M + 1] = xk | D) =
EQUIVALENT SAMPLE
SIZE
Mk + αk
M + k αk
This prediction is similar to prediction with the MLE parameters. The only difference is that we added the hyperparameters to our counts when making the prediction.
For this reason the Dirichlet hyperparameters are often called pseudo-counts. We can
think of these as the number of times we have seen the different outcomes in our prior
experience before conducting our current experiment.
The total of the pseudo-counts
reflects how confident we are in our prior. To see
this, we define M = αk to be the sample size of the pseudo-counts. The parameter
M is often called the equivalent sample size. Using M , we can rewrite the hyperparameters as αk = M θk , where θ = {θk : k = 1, . . . , K} is a distribution describing
the mean prediction of our prior. We can see that the prior prediction (before observing
any data) is simply θ . Moreover, we can rewrite the prediction given the posterior as:
P (x[M + 1] = xk | D) =
M
M
Mk
θk +
·
M +M
M +M
M
(3)
That is, the prediction is a weighted average (convex combination) of the prior mean
and the MLE estimate. The combination weights are determined by the relative magnitude of M — the confidence of the prior (or total weight of the pseudo-counts) —
and M — the number of observed samples.
To gain some intuition for the interaction between these different factors, Figure 5
shows the effect of the strength and means of the prior on our estimates. We can
13
0.6
0.6
0.5
P(X=H)
P(X=H)
0.5
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0.4
20
40
60
80
100
M = #samples
0
0
20
40
60
80
M = #samples
(a)
(b)
Figure 5: The effect of the strength and means of the prior on our estimates. Our
empirical distribution is an idealized version of samples from a biased coin where the
frequency of heads is 0.2. The x axis represents the number of samples from the
distribution, and the y the expected probability of heads according to the Bayesian
estimate. (a) shows the effect of varying the prior means θh , θt , for a fixed prior strength
M . (b) shows the effect of varying the prior strength for a fixed prior mean θh = θt =
0.5.
see that, as the amount of real data grows, our estimate converges to the empirical
distribution, regardless of the starting point. The convergence time grows both with the
difference between the prior mean and the empirical mean, and with the strength of the
prior.
Based on Eq. (3), we can see that the Bayesian prediction will converge to the MLE
estimate in two situations
• When M → ∞. Intuitively, when we have very large training set the contribution of the prior is negligible, and the prediction will be dominated by the
frequency of outcomes in the data.
• When M → 0. In this case, we are unsure about our prior. Note that the case
where M = 0 is not achievable: the normalization constant for the Dirichlet
prior grows to infinity when the hyperparameters are close to 0. Thus, the prior
with M = 0 (that is, αk = 0 for all k) is not well defined. Nonetheless, we
can define its behavior by examining the limit when M approaches 0. The prior
with M = 0 is often called an improper prior.
The difference between the Bayesian estimate and the MLE estimate arises when
M is not too large, and M is not close to 0. In these situations, the Bayesian estimate
is “biased” toward the prior probability θ . In effect, the Bayesian estimate is then
smoother than the MLE estimate. Since we have few samples, we are quite unsure
about our estimate given the data. Moreover, we can see that an additional sample will
change the MLE estimate dramatically.
14
100
P(X = H|D)
0.7
0.6
0.5
0.4
0.3
0.2
0.1
N
5
10
15
20
25
30
35
40
45
50
H
T
M
Figure 6: The effect of priors on smoothing our parameter estimates. The graph shows
the estimate of P (X = H|D) (y-axis) after seeing different number of samples (xaxis). The graph below the x-axis shows the particular sequence of tosses. The solid
line corresponds to the MLE estimate, and the remaining ones to Bayesian estimates
with different strengths, and uniform prior means. The large-dash line corresponds to
Beta(1, 1), the small-dash line to Beta(5, 5), and the dotted line to Beta(10, 10).
Example 2.5: Suppose we are trying to estimate the parameter associated with a coin,
and we observe one head and one tail. Our MLE estimate of θH is 1/2 = 0.5. Now, if
the next observation is a head, we will change our estimate to be 2/3 ≈ 0.66. On the
other hand, if our next observation is a tail, we will change our estimate to 1/3 ≈ 0.33.
In contrast, consider the Bayesian estimate with a Dirichlet prior with M = 1 and
θH
= 0.5. With this estimator our original estimate is 1.5/3 = 0.5. If we observe
another head, we revise to 2.5/4 = 0.625, and if observe another tail, we revise to
1.5/4 = 0.375. We see that the estimate changes by slightly less after the update. If
M is larger, then the smoothing is more aggressive. For example, when M = 5,
our estimate is 4.5/8 = 0.5625 after observing a head, and 3.5/8 = 0.4375 after
observing a tail. We can also see this effect visually in Figure 6, which shows our
changing estimate for P (θH ) as we observe a particular sequence of tosses.
This smoothing effect results in more robust estimates when we do not have enough
data to reach definite conclusions. If we have good prior knowledge, we revert to it.
Alternatively, if we do not have prior knowledge, we can use a uniform prior that will
keep our estimate from taking extreme values. In general, it is a bad idea to have
extreme estimates (ones where some of the parameters are close to 0) since these might
assign too small probability to new instances we later observe. In particular, as we
already discussed, probability estimates that are actually 0 are dangerous, as no amount
of evidence can change it. Thus, if we are unsure about our estimates, it is better to
bias them away from extreme estimates. The MLE estimate, on the other hand, often
assigns probability 0 to values that were not observed in the training data.
15
0.04
P(D|θ)P(θ)
0.035
0.03
0.017
0.02
0.01
0
0
0.2
0.4
θ
0.6
0.8
1
Figure 7: Example of the differences between maximal likelihood score and marginal
likelihood for the sequence of coin tosses H, T, T, H, H.
3 Marginal Likelihood
Consider a single binary random variable X, and assume we have a prior distribution
Dirichlet (αh , αt ) over X. Consider a data set D that has Mh heads and Mt tails.
Then, the maximum likelihood value given D is
Mh Mt
Mh
Mt
P (D | θ̂) =
·
M
M
Now, consider the Bayesian way of assigning probability to the data.
P (D | θG , G)P (θG | G)dθG
P (D | G) =
(4)
ΘG
M ARGINAL L IKELIHOOD
where P (D | θG , G) is the likelihood of the data given the network G, θG and P (θG |
G) is our prior distribution over different parameter values for the network G. We call
this term the marginal likelihood of the data given the structure, since we marginalize
out the unknown parameters.
Here, we are not conditioning on the parameter. Instead, we need to compute the
probability P (X[1], . . . , X[M ]) of the data given our prior. One approach to computing this term is to evaluate the integral Eq. (4). An alternative approach uses the chain
rule
P (x[1], . . . , x[M ]) = P (x[1]) · P (x[2] | x[1]) · . . . · P (x[M ] | x[1], . . . , x[M − 1])
Recall that if we use a Dirichlet prior, then
Mhm + αh
m+α
where Mhm is the number of heads in the first m examples. For example, if D =
H, T, T, H, H,
P (x[m + 1] | x[1], . . . , x[m]) =
P (x[1], . . . , x[5]) =
=
αh
αt
αt + 1 αh + 1 αh + 2
·
·
·
·
α α+1 α+2 α+3 α+4
[αh (αh + 1)(αh + 2)][αt (αt + 1)]
α · · · (α + 4)
16
Picking αh = αt = 1, so that α = αh + αt = 2, we get
[1 · 2 · 3] · [1 · 2]
12
=
= 0.017
2·3·4·5·6
720
(see Figure 7) which is significantly lower than the log-likelihood
3 2
3
2
108
≈ 0.035.
·
=
5
5
3125
Thus, the log-likelihood ascribes a much higher probability to this sequence than does
the marginal likelihood. The reason is that the log-likelihood is making an overly
optimistic assessment, based on a parameter that was designed with full retrospective
knowledge to be an optimal fit to the entire sequence.
In general, for a binomial distribution with a Beta prior, we have
P (x[1], . . . , x[M ]) =
[αh · · · (αh + Mh − 1)][αt · · · (αt + Mt − 1)]
α · · · (α + M − 1)
Each of the terms in square brackets is a product of a sequence of numbers such as
α · (α + 1) · · · (α + M − 1). If α is an integer, we can write this product as (α+M−1)!
(α−1)! .
However, we do not necessarily know that α is an integer. It turns out that we can
use a generalization of the factorial function for this purpose. Recall that the Gamma
function is such that Γ(m) = (m − 1)! and Γ(x + 1) = x · Γ(x). Using the later
property, we can rewrite
α(α + 1) · · · (α + M − 1) =
Γ(α + M )
Γ(α)
Hence,
P (x[1], . . . , x[M ]) =
Γ(αh + Mh ) Γ(αt + Mt )
Γ(α)
·
·
Γ(α + M )
Γ(αh )
Γ(αt )
A similar formula holds for a multinomial distribution over the space x1 , . . . , xk ,
with a Dirichlet prior with hyperparameters α1 , . . . , αk :
P (x[1], . . . , x[M ]) =
k
Γ(α)
Γ(αi + M [xi ])
·
.
Γ(α + M ) i=1
Γ(αi )
(5)
Note that the final expression for the marginal likelihood is invariant to the order
we selected in the expansion via the chain rule. In particular, any other order results
in exactly the same final expression. This property is reassuring, because the i.i.d. assumption tells us that the specific order in which we get data cases is insignificant.
Also note that the marginal likelihood can be computed directly from the same sufficient statistics used in the computation of the likelihood function — the counts of the
different values of the variable in the data.
17