Download MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Statistics wikipedia , lookup

History of statistics wikipedia , lookup

Probability wikipedia , lookup

Transcript
MACHINE LEARNING INTRODUCTION: STRING
CLASSIFICATION
THOMAS MAILUND
Machine learning means different things to different people, and there is no general
agreed upon core set of algorithms that must be learned. In this class we will therefore
not focus so much on specific algorithms or machine learning models, but rather give an
introduction to the overall approach to using machine learning in bioinformatics, as we
see it.
To us, the core of machine learning boils down to three things: 1) Building computer
models to capture some desired structure of the data you are working on, 2) training
such models on existing data to optimise them as well as we can, and 3) use them to
make predictions on new data.
In these lecture notes we start with some toy examples illustrating these steps. Later
you will see a concrete example of this when building a gene finder using a hidden
Markov model. At the end of the class you will see algorithms that do not quite follow
the framework in these notes, just to see that there are other approaches.
1. Classifying strings
To illustrate the three core tasks mentioned above, we use a toy example where we
want to classify strings as coming from one class of strings rather than another. It is a
very simple example, and probably not quite an approach we would actually take in a
real application. It illustrates many of the core ideas you will see when you work on the
hidden Markov model project later in the class, though.
The setup we imagine is this: we somehow get strings that are generated from one
of two processes, and given a string we want to classify it according to which process
it comes from. To do this, we have to 1) build a model that captures strings, 2) train
this model to classify strings, and 3) use the model on new strings. Going through the
example we’ll switch 2) and 3), though; we need to know how to actually classify strings
using the model before we can train the model to do it. Anyway, those are the tasks.
2. Modelling strings from different processes
By “modelling” we mean constructing an algorithm or some mathematics we can
apply to our data. Think of it as constructing some function, f , that maps a data point,
x, to some value y = f (x). In the general case, both x and y can be vectors. A good
model is a function where f extracts the relevant features of the input, x, and gives us
a y we can use to make predictions about x; in this case that just means that y should
be something we can use to classify x.
That’s a bit abstract, but in our string classification problem it simply means that we
want to construct a function that given a string gives us a classification.
1
2
THOMAS MAILUND
2.1. Modelling, probabilities, and likelihoods. In machine learning we are rarely
so lucky that we can get perfect models, that is models that with 100% accuracy classifies
correctly. So we cannot expect that f will always give us a perfect y; at best we can
hope for a “good” f . We need to quantify what “good” means, in order to know exactly
how good a model we have, to compare too model to know which is better, and in order
to optimise a model to be as good as we can make it.
Probability theory and statistics gives us a very strong framework to measure how
good a given model is, and general approaches we can use to train models. It is not the
only approach to machine learning, but practically all classical machine learning models
and algorithms can be framed in terms of probabilistic models and statistical inference,
so as a basic framework it is very powerful.
For a probabilistic model of strings from two different classes, we can look at the joint
probability of seeing a string x ∈ Σ∗ from class Ci : Pr(x, Ci ). In section 3, Classifying
Strings, we will see how to classify strings from this, but for now let us just consider
how to specify such a probability.
In general we will build models with parameters we can tweak to fit them to data, so
rather than having a specifying Pr(x, Ci ) we have a whole class of probabilities indexed
by parameters θ: Pr(x, Ci ; θ) where θ can be continuous or discrete, a single value or an
arbitrary long vector of values, whatever we come up with for our model.
Training our model will boil down to picking a good parameter point, θ̂, where we can
then use the function (x, Ci ) 7→ Pr(x, Ci ; θ̂) for classifying x. This function we call the
probability of (x, Ci ) given parameters θ̂ (and implicitly given the assumed model). If we
imagine keeping the data point fixed instead, at some point (x̂, Cˆi ), we have a function
mapping parameters to values: θ 7→ Pr(x̂, Cˆi ; θ). This we call the likelihood of θ given
the data (x̂, Cˆi ), and we sometimes write this lhd(θ ; x̂, Cˆi ) instead of Pr(x̂, Cˆi ; θ).
The only difference between probability of the data given the parameters, or the
likelihood of the parameters given the data, is which part we keep fixed and which
we vary. We require that Pr(x, Ci ; θ) is a probability distribution over (x, Ci ) though,
which means that the sum over all possible values of x and Ci (or integrating if we had
continuous variables) must be 1, while we do not require that summing (or integrating)
over all possible values of θ should be 1.
2.2. Modelling strings from two classes. How exactly to define a probability like
Pr(x, Ci ; θ) is often subjective and somewhat arbitrary, and there rarely is one “right
way” of doing it. So it is somewhat like programming: there are many ways you can
solve a problem and you can be more or less creative about it. There are some general
strategies that are often useful, but it always depends on the application and there are
no guarantees that these strategies will work. You just have to try and see how it goes.
One strategy that is often successful when we want to classify data, is to look at
the probability of a given data point conditional on the class, that is the probability
Pr(x | Ci ; θ). We will specify the probability of a string x in each of the two classes, C1
and C2 and then uses the differences in these probabilities to decide which class x most
likely comes from.
Since we are unlikely to guess the “true” model in any real application of machine
learning, constructing models all boils down to constructing something that is fast to
MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION
3
compute and good enough for our purpose. As with any programming task, it makes
sense to start simple. For a string x = x1 x2 · · · xk we need to define Pr(x1 , x2 , . . . , xk | Ci ; θ).
A simple model assumes first that the letters in x are independent and second that the
probability of seeing a given letter a ∈ Σ is independent on the index in x. With these
assumptions, the probability of the string, which is the joint probability of the letters
in the string, becomes Pr(x1 | Ci ; θ) Pr(x2 | Ci ; θ) · · · Pr(xk | Ci ; θ). A set of parameters
for such a model could specify the probability of seeing each alphabet in the alphabet,
Pr(a | Ci ; θ) so our parameters could specify those. Let θ = (p(1) , p(2) ) where pi is a
P
vector indexed by letters a in our alphabet and with a∈Σ p( i)a = 1. We use p(i) as the
distribution of letters in class Ci then and consider it a parameter we can fit to the data
when we later train the model.
To compute the probability of any given string, assuming it came from class Ci , you
simply look up index x[j], j = 1, . . . , k in θi and multiply them together:
Pr(x = x1 x2 · · · xk | Ci ; θ) =
k
Y
(i)
px[j]
.
j=1
P
(i)
Since a∈Σ pa = 1 we have |Σ| − 1 parameters from each of the two classes, and
for each choice of parameters we get slightly different distributions over strings from the
two classes. It is through the differences between p(1) and p(2) we will be able to classify
a string x.
Now, whether this is a good model for our application depends a lot on what the real
data looks like. It might not capture important structure in the real data. For instance,
the assumption that the letter probability is independent of the index in the string might
be incorrect (and we will see a model where the distribution depends on the index in next
week’s lectures), or the probability of letters might not be independent between them
(we will see an example of this when we work with hidden Markov models). Deciding
whether you have made a good model often is a question of comparing data you simulate
under your constructed model and comparing it with real data to see if there are large
differences. If there are, you should improve on your model to fit the data better, but
quite often simple models are good enough for our application.
Of course, even if this model is a completely accurate model of the real data it doesn’t
mean that we are going to be able to easily classify strings. If you are equally likely to
see each letter in the alphabet whether a string comes from C1 or C2 then each class will
give roughly the same probability to each string x and this model will not be able to
distinguish them. Nevertheless, this is going to be our model for string classification.
2.3. Exercises. Write a function in your preferred programming language that simulate
strings of length k given a vector of letter probabilities p and another function that given
a string and a vector of letter probabilities computes the probability of the string.
Use the simulator to simulate a string and compute the probability of that string both
using the true p you used when simulating and some different probability vector p0 .
You can measure how how far p0 is from p using the Kullback-Leibler divergence
0
X
pa
0
· p0a
DKL (p kp) =
log
p
a
a
4
THOMAS MAILUND
(http://en.wikipedia.org/wiki/KullbackLeibler_divergence). If you plot this
distance between p and p0 against the ratio between the probability of x under the
two models
Pr(x ; p)/ Pr(x ; p0 )
what happens?1 (You might want to try this with a number of different simulated
strings, since choosing random strings gives different results each time).
Intuitively, you would expect longer strings to contain more information about the
process that generated them than shorter strings does. What happens with Pr(x ; p) and
Pr(x ; p0 ) when you simulate longer and longer strings? Try plotting Pr(x ; p0 )/ Pr(x ; p)
against the length of x. Again, for each string length you might want to sample several
strings to take stochastic variation into account.
If you simulate long strings you will probably quickly run into underflow problems.
You avoid this if you compute the log-likelihood instead of the likelihood, i.e.
log Pr(x ; p) =
|x|
X
log(px[j] )
j=1
instead of
Pr(x ; p) =
|x|
Y
px[j]
j=1
and if you do that you want to look at the difference
log Pr(x ; p0 ) − log Pr(x ; p)
instead of the ratio
Pr(x ; p0 )
Pr(x ; p)
If instead of having a single string x had a set of strings D = {x1 , x2 , . . . , xn } then
how would you write the probability of the set D coming from the distribution p? If you
do the exercises above with a set of strings rather than a single string, what changes?
3. Classifying strings
Now, what we wanted to build was a model that, given a string x, would tell us if x
came from C1 or C2 . From the model of Pr(x | Ci ; θ) we developed above we therefore
want to get a function x 7→ Pr(Ci | x ; θ) instead. If we know Pr(Ci | x ; θ) we would
classify x as belong to class Ci if Pr(Ci | x ; θ) is high enough. If we have two classes
to choose from, this typically means that we would classify x as coming from C1 if
Pr(Ci | x ; θ) > 0.5 and classify it as coming from C2 otherwise. We don’t always have to
classify x though, and sometimes we might have an application where we should only
1The probability of x gets exponentially smaller as the length increases (do you see why?) so comparing different lengths can be difficult. The ratio here shows have probably x is from one model
over the other, and while both nominator and denominator shrinks exponentially the fraction still
tells you the relative support of one model compared to the other. This particular ratio is called
the likelihood ratio since it is just another way of writing lhd(p0 ; x)/lhd(p ; x). If we think of p and
p0 as two different models, rather than two different parameter points, it is called the Bayes factor
(http://en.wikipedia.org/wiki/Bayes_factor).
MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION
5
classify if we are relatively certain that we are right. This we also get from knowing
Pr(Ci | x ; θ) since we then simply require that the support for the chosen class is high
enough, and refrain from classifying strings where there is not high enough probability
for a single class.
We get the formula we want from Bayes’ formula
Pr(A | B) Pr(B)
Pr(B | A) =
Pr(A)
which for our example means
Pr(x | Ci ; θ) Pr(Ci ; θ)
Pr(Ci | x ; θ) =
Pr(x ; θ)
which introduces two new probabilities: Pr(Ci ; θ) and Pr(x ; θ).
We can compute Pr(x ; θ) from the other two probabilities since
Pr(x ; θ) = Pr(x | C1 ; θ) · Pr(C1 ; θ) + Pr(x | C2 ; θ) · Pr(C2 ; θ)
assuming that there are only the two classes C1 and C2 .2 The other probability, Pr(Ci ; θ),
we have to specify.
The probability Pr(Ci ; θ) is independent of x and can be thought of as how likely it
is that any given string would be chosen from that class in the first place. This can just
be another parameter of our model, π such that the set of parameters is now θ = (π, pia )
where pi , i = 1, 2, are the probabilities of the letters for the two classes as before, and
Pr(C1 ; θ) = π and Pr(C2 ; θ) = 1 − π.
The parameter π is something we must set, either explicitly or train from data as we
see in the next section. For now, let us just consider what the functions Pr(Ci ; θ) and
Pr(x | Ci ; θ) tells us, and how they help us pick the right class for a string x.
The so-called prior probability, Pr(Ci ; θ) is a probability that describes how likely we
think it is that class Ci produces a string to begin with. If we think that C1 and C2 are
equally likely to produce strings it doesn’t matter so much when going from Pr(x | Ci ; θ)
to Pr(Ci | x ; θ), but if we expect for example only one in a hundred string to come from
C1 would need more evidence that a specific string, x, is likely to have come from C1 if
we want to classify it as such.
The probability of the string given the class, Pr(x | Ci ; θ), on the other hand tells us
how likely it is that class would produce the string x. For that reason we can call it the
likelihood, although we have already used that term for the probability as a function of
the parameters θ. Still, you might sometimes see it called the likelihood, and in most
ways it behaves like a likelihood, where as you recall lhd(θ ; x) is just a way of saying
Pr(x ; θ). If you think of Ci as a parameter of the model rather than a stochastic variable
we condition on you see the resemblance.3 If Pr(x | C1 ; θ) Pr(x | C2 ; θ), that is C1
is much more likely to produce the string x than C2 is, then observing x weighs the
2This follows from how we calculate with probabilities. We can marginalise over some of the paP
rameters in a joint distribution so Pr(A) = i Pr(A, Bi ) and by definition of conditional distributions
Pr(A, B) = Pr(A | B) · Pr(B).
3I have been careful to distinguish between conditional probabilities, Pr(A | B) and parameterised distributions Pr(A ; θ) but in all the arithmetic we do there really isn’t much of a difference. A conditional
probability is just a parameterised distribution and the only difference from having a conditional distribution and a parameterised distribution is whether we think parameters can be thought of as stochastic
6
THOMAS MAILUND
odds towards C1 rather than C2 , so even if we a priori thought that we would only see a
string from C1 one times in a hundred, if Pr(x | C1 ; θ) is a thousand times higher than
Pr(x | C2 ; θ), then observing x it would still be more likely that it came from x.
It is by combining the prior probability of seeing the class Ci with how likely it is to
produce the string we observe that we get the posterior probability of Ci : Pr(Ci | x ; θ).
We often write this intuition in the following form:
Pr(x | C1 ; θ) Pr(C1 ; θ)
Pr(C1 | x ; θ)
=
×
Pr(C2 | x ; θ)
Pr(x | C2 ; θ) Pr(C2 ; θ)
and you can think of
Pr(C1 ; θ)
Pr(C2 ; θ)
as the prior odds, that is the odds of seeing something
1 | x ; θ)
from C1 rather than C2 to begin with, and of Pr(C
Pr(C2 | x ; θ) as the posterior odds, that is the
odds that the x you saw came from C1 rather than C2 .
If C1 is unlikely to happen to begin with, the prior odds are small. However, if we
then observe a string that C1 is very likely to produce and C2 is unlikely to produce the
odds changes. The stronger the prior odds are against C1 the more evidence we demand
to see before we select C1 over C2 .
Since
Pr(C1 | x ; θ)
> 0 ⇔ Pr(C1 | x ; θ) > Pr(C2 | x ; θ)
Pr(C2 | x ; θ)
we would classify x as coming from C1 if the posterior odds are higher than 1 (or sufficiently higher than 1 if we want to avoid less certain cases) and classify it as coming from
C2 if the posterior odds are below 1. If the posterior odds are exactly 1 it is probably
best not to make a decision.
3.1. Exercises. Pick two letter distributions, p and p0 and simulate n strings of length
k from each. Classify a string x as class C1 if Pr(x | C1 ; θ) > 0.5 and as C2 otherwise and
measure how well you do (how many strings you assign to the right class divided by the
number of strings, 2n). How well do you classify as a function of how far p is from p0 ?
How well do you classify as a function of the length of the strings?
Now simulate strings by first randomly choosing p or p0 so you choose p with some
probability π. Classify the strings both as above and by using their posterior odds (or
posterior probabilities, whichever you prefer, it gives you the same result). Compare the
accuracy of the classification when the prior probabilities / prior odds are taken into
account versus when they are not. Plot the accuracy with both approaches as a function
of π.
4. Training the string classifier
Finally we come to training the model, that is, how to set the parameters of the
model θ = (π, p(1) , p(2 ). We of course want to choose the parameters in such a way that
we maximise the probability of classifying a new string that might show up. We don’t
know which strings we are likely to see, however, nor which classes they come from,
and until we actually have a set of parameters we cannot even make educated guesses
about it. Just saying that we want to optimise how well we can do in the future is thus
and having a distribution or not. This philosophical distinction is the difference between Bayesian and
Frequentist statistics.
MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION
7
not something we can tell a computer to do; we need some algorithm for setting the
parameters, such that at least we are likely to get a good classifier for future data.
One can show as a general result that if you have the right model, then on average you
cannot do better than using the true parameters. That is, the best average performance
you can achieve on new data you get if you use the true parameters of the model. If we
don’t really have the right model all bets are off, really, and unfortunately this is almost
always the case. It’s a bit of a fatalistic thought, though, so we are going to assume that
we have the right model (and if not we always go back to modelling to at least get one
that is as close as possible), because if we have the right model we have some general
approaches to estimating the true parameters.
If we do not have any data to work with, we can do nothing but guess at the parameters, but typically we can get a set of data D = {(x1 , t1 ), (x2 , t2 ), . . . , (xn , tn )} of
data points xj and “targets” tj ; in our application strings xj ∈ Σ∗ and associated classes
tj ∈ {C 1 , C 2 }. From this data we need to set the parameters. There are three approaches
that are frequently used and many machine learning algorithms are just concrete algorithms for one of these general approaches. They are not always the best choice, but
always a good choice and unless you can show that an alternative can do better you
should use one of these. The approaches are:
(1) you maximise the likelihood,
(2) you maximise the posterior,
(3) or you make predictions using a posterior distribution,
see equations (1), (2) and (3) below.
The first is a Frequentists approach (http://en.wikipedia.org/wiki/Frequentist_
inference) while the second and third are Bayesian (http://en.wikipedia.org/wiki/
Bayesian_inference). We will only use the first two in this class but just mention the
third in case you run into it in the future.
4.1. Maximum likelihood estimates. For maximum likelihood estimation (http:
//en.wikipedia.org/wiki/Maximum_likelihood), as the name suggests, is based on
maximising the likelihood function lhd(θ ; D) = Pr(D ; θ) with respect to the parameters
θ:
(1)
θ̂MLE = argmaxθ Pr(D ; θ)
This you do in whatever fashion you can, just as with maximising any other function.
∂
lhd(θ ; D) = 0
Sometimes you can do this analytically by setting the derivative to zero, ∂θ
(when θ is a vector you set the gradient to zero ∇lhd = 0), or more often the log likelihood since those are often easier to take the derivative off. Sometimes there are constraints on what values θ can legally take which complicates this slightly, and sometimes
we simply cannot solve this analytically. So not surprisingly there are algorithms and
heuristics in the literature for how to maximise this function for specific machine learning methods and the Baum-Welch algorithm you will see for hidden Markov models is
one such algorithm and an instance of a general class of optimisation algorithms called
Expectation-Maximisation or EM. EM is a numerical optimisation that is guaranteed to
find a local but not necessarily a global maximum. Quite often you have to use heuristics
and numerical algorithms to optimise the likelihood.
8
THOMAS MAILUND
Intuitively, maximising Pr(D ; θ) is a sensible thing to do; you are picking the parameters that make the data you have observed most likely. There are also more theoretical
properties with the maximum likelihood estimates that makes them a good choice, not
least that they are guaranteed to converge towards the real parameters as the number
of data points grows. They can often be biased, meaning that on average they slightly
over- or under-estimate the true parameters, but this bias is guaranteed to get smaller
and smaller as the number of data points grows. Still, in many algorithms you will
see estimators that corrects for the bias in the maximum likelihood estimator to get
an unbiased estimate that still converges to the true value. We won’t worry about this
and just maximise likelihoods (and hope that we have enough data and are sufficiently
converged to the true value that we needn’t worry about the bias).
4.2. Bayesian estimates. For Bayesian estimation we tread the parameter not just as
an unknown value but as a stochastic one with its own distribution. So rather than
having a likelihood lhd(θ ; D) = Pr(D ; θ) we have a conditional distribution lhd(θ | D) =
Pr(D | θ). From this we can get a posterior distribution over parameters, Pr(θ | D),4 and
we can use this in two different ways in our classification model.
With a posterior distribution over parameters it makes more sense to choose the
parameters with the maximal probability rather than the maximal likelihood, and we
call this estimator the maximum a posteriori estimator:
(2)
θ̂MAP = argmaxθ Pr(θ | D) = argmaxθ Pr(D | θ) Pr(θ)
where in the last equality we ignore dividing by Pr(D) using Bayes rule since this is a
constant when optimising with respect to θ.
With a Bayesian approach, however, you do not need to maximise the posterior.
You have a distribution of parameters and by integrating over all possible parameters,
weighted by their probability, you can make predictions as well. So if you need to make
predictions for a data point x, rather than using a single estimated parameter θ̂ (whether
maximum likelihood estimate or maximum a posteriori estimate) and the probability
Pr(x ; θ̂) you can get the probability of x given all the previous data, Pr(x | D) using
Z
Z
Pr(D | θ) Pr(θ)
dθ
(3)
Pr(x | D) = Pr(x | θ) Pr(θ | D) dθ = Pr(x | θ)
Pr(D)
A main benefit of using Bayesian approaches is that we can alleviate some of the
problems we have with the stochastic variation of estimates when we have very little
data. The maximum likelihood estimates will converge to the true parameters, but
when there is little data the estimate can be far from the truth just by random chance.
Imagine flipping a coin to estimate the probability p is seeing heads rather than tail.5
If you flip a coin n times and see h heads, the maximum likelihood estimate for p is
p̂MLE = h/n. As n → ∞ the estimate will go to the true value, p̂MLE → p, but for small
4Strictly speaking, if θ is continuous then we have a density for it rather than a probability, but I
am not going to bother making a distinction in these notes; when you see me use the notation Pr(x)
for
R a continuous variable x just substitute it with a density f (x) if you prefer. Likewise, if I integrate
f (x) dx and f (x) is discrete then read it as a sum over all values of x. If there are risks for confusion
when doing this, I will point them out, but there very rarely is.
5Coin flipping is a classical example because it is simple, but it is physically very hard to get a biased
coin, see http://www.stat.columbia.edu/~gelman/research/published/diceRev2.pdf.
MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION
9
n it can be quite far from the true value. For the extreme that n = 1 we have p̂ = 0 or
p̂ = 1 regardless of the true value of p.
Using a prior distribution over parameter values we can nudge the estimated parameters away from extremes, and if we have some idea about what likely values are going to
look like we can capture this with the prior distribution. For the coin toss example, for
example, we can use the prior to make it more likely that p is around 0.5 if we a priori
believe that the coin is unbiased.
Using the full posterior probability to make prediction, as in (3), rather than just the
most likely value, as in (2), captures how certain you are in the parameter estimate. If
you use the θ̂MAP estimator for prediction, you predict with the same certainly regardless
of how concentrated the posterior probability is around the maximum point θ̂MAP . Using
(3) is better in the sense that it takes into account all your knowledge about parameters
that you have learned from the data. It is not always easy to get a computational fast
model where you can do it, though, so we won’t see more of this in this class.
4.3. Conjugate priors. Taking a Bayesian approach means that we have two new
probabilities, Pr(θ) and Pr(D). The latter is the probability for the data without conditioning on the parameters, something that might look odd since we have modelled the
probability of data given parameters, but it comes from treating parameters as stochastic
and marginalising:
Z
Z
Pr(D) = Pr(D, θ) dθ = Pr(D | θ) Pr(θ) dθ
For maximising the posterior we don’t need this probability, however, since it doesn’t
depend on the θ we maximise with respect to. For using using the full posterior distribution (3) it is needed for normalisation but in many cases we can avoid computing it
explicitly through the integration, as we will see below.
The probability Pr(θ) is something we have to provide and not something we can
train from the data (it is independent of the data unlike the likelihood, after all). So it
becomes part of our modelling and it is up to our intuition and inventiveness to come
up with a good distribution.
A good choice is a so-called conjugate prior which is a function that makes it especially
easy to combine prior distributions and likelihoods into posterior distributions (http://
en.wikipedia.org/wiki/Conjugate_prior). The idea behind conjugate priors is that
the prior is chosen such that both prior and posterior distribution is from the the same
parameterised family of functions f (− ; ξ), i.e. the difference between prior and posterior
is the (meta-)parameter of this function: Pr(θ) = f (θ ; ξ0 ) and Pr(θ | D) = f (θ ; ξD ). To
use conjugate priors, you then simply need a function that combines prior and data
meta-parameter into the posterior meta-parameter: g(ξ0 , D) = ξD .
If we take coin-flipping again as an example, we can write the likelihood of the head
probability p from a single coin toss as
lhd(p | h) = Pr(h | p) = ph · (1 − p)1−h
where h is 1 if we observe head and 0 if we observe tail. A series of independent coin
toss will just be a product of these for different outcomes, so n tosses with h heads and
10
THOMAS MAILUND
t = n − h tails has the likelihood
h+t h
lhd(p | h, t) =
p (1 − p)t
h
where h+t
is the binomial coefficient, needed to normalise the function as a probability
h
over the number of heads and tails.
For a conjugate prior we want a function f (p ; ξ0 ) such that
ph (1 − p)1−h f (p ; ξ0 ) = f (p ; ξh,t )
for some ξh,t . If we take something on the same form as the likelihood, with the metaparameter similar to the heads and tail observations, we can define
f (p ; ξ0 ) = Cα,β · pα (1 − p)β
where ξ0 = (α, β) and Cα,β the normalisation constant so this is a distribution over p.6
Combining this prior with the likelihood, ignoring for now normalising constants, we
get
h
i h
i
lhd(p | h, t) · f (p ; α, β) ∝ ph (1 − p)t · pα (1 − p)β
= ph+α (1 − p)t+β
∝ f (p ; h + α, t + β)
where the last line comes from how we define
f (p ; h + α, t + β) = Ch+α,t+β · ph+α (1 − p)t+β ∝ ph+α (1 − p)t+β
Since this shows that the posterior is proportional to f (p ; h + α, t + β) and since
f (p ; h + α, t + β) by definition integrates to 1 and the posterior does as well by the
property of being a density, they must be equal. So we move from prior to posterior by
modifying the meta-parameters based on the observed data: g(ξ0 , D) = g((α, β), (h, t)) =
(α + h, β + t) = ξh,t . The new meta-parameters can be combined with more data if we
get more, and this way that distribution for p can be updated each time we observe more
data using this procedure again and again.
The conjugate prior for a given likelihood of course depends on the form of the likelihood, but most standard probability distributions have a corresponding conjugate that
you can look up if you need it. The conjugate prior for the coin toss above is called a
Beta distribution, Beta(α, β), and a way of thinking about the meta-parameters α and
β is as pseudo-counts of heads and tails, respectively. Using the prior, we pretend that
we have already observed α heads and β tails. If we set α = β we imply that we believe
that there is an equal chance of seeing heads and tails. The larger numbers we use for
α and β the stronger we make the influence of the prior; if we pretend that we have
seen 50 heads and 50 tails, observing 5 new tosses is not going to move our posterior
distribution far away from 0.5. As h + t grows higher and higher compared to α + β,
the more the posterior probability is influenced by the observed data rather than the
prior, and the posterior will look more and more like just the likelihood. Consequently,
the maximum a posteriori estimator will converge to the same point as the maximum
6For f (p ; α, β) to be a density over p it is necessary that R 1 f (p ; α, β) dp = 1 which means that
0
1
Cα,β
=
R1
0
pα (1 − p)β dp.
MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION
11
likelihood estimator and thus share the nice property that it converges to the true value.
It is just potentially less sensitive to stochastic fluctuations due to little initial data.
4.4. Training the string classifier. Going back to our string classification problem,
we want to estimate parameters π, the probability of seeing class C1 rather than C2 , and
the two letter distributions p(1) and p(2) . Training data would be a set of strings paired
with their class D = {(s1 , c1 ), (s2 , c2 ), . . . , (sn , cn )}.
If the training data is generated such that the probability of a string coming from C1
is actually π we can estimate π the same way as we estimated the probability of seeing
head rather than tail with a coin toss. The maximum likelihood estimator of π would
be
nC
π̂MLE = 1
n
where nC1 is the number of ci from C1 . With a Beta(α, β) prior for π we could instead
use a maximum a posteriori estimate
n C1 + α
π̂MAP =
n+α+β
where again we see that α and β works as pseudo counts for C1 and C2 , respectively.
The strings paired with C1 are independent from the strings paired with C2 , and p(1)
only depend on the first set and p(2) only on the second. For convenience and without
lack of generality we assume that s1 , s2 , . . . , sm are the strings paired with C1 . The
distribution p(1) is a multinomial distribution and the maximum likelihood estimator
has
(1)
na
(1)
p̂a = (1)
L
P
for all a ∈ Σ, where n(1) is the number of times a occur in s1 , . . . , sm and L(1) = m
j=1 |sj |
is the total string length of the strings from C1 . The distribution from the other class,
p(2) is estimated exactly the same way, but using the strings sm+1 , . . . , sn .
We can also add priors to multinomial distributions. The conjugate is called a Dirichlet
distribution but it works just as the pseudo counts we have already seen. If we have a
P
(1)
pseudo count αa for all a ∈ Σ and let α = a∈Σ αa , then the MAP estimator would be
(1)
p̂(1)
a =
na + αa
L(1) + α
4.5. Exercises. Use your simulator from earlier to simulate data sets of strings paired
with the class that produced them. Estimate π from this data and plot the distance from
your estimate, π̂, to the simulated value, πsim : |π̂ − πsim |, as a function of the number of
strings you have simulated. You want to simulate several times for each size of the data
to see the stochastic variation from the simulations. Try this for several values of πsim .
Try both the maximum likelihood estimator, π̂MLE = nC1 /n and a maximum a posteriori estimator, π̂MAP = (nC1 + α)/(n + α + β). Try different values of α and β to see
how this affects the estimation accuracy.
Simulate strings with a letter probability distribution psim and try to estimate this
distribution. Try both the p̂MLE and p̂MAP estimation, with different pseudo counts for
12
THOMAS MAILUND
the maximum a posteriori estimator. Plot the Kullback-Leibler divergences between the
simulated and estimated distribution as a function of to total string length simulated.
Finally, put it all together so you can simulate a set of strings, paired with classes,
estimate all the parameters of this model, and then make predictions on the strings to
test how well the prediction matches the simulated values.