Download Bayes Theorem and Classification

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
3. Model-based Pattern Recognition
In this section we use Bayes Theorem in two
different ways
 to learn models of data
 to use the models to classify new data
Bayes Theorem and Classification
Now we return to the problem of how to assign an object to one of a set of
classes.
Here the hypotheses that we have to consider are
H 1 : the object belongs to class 1
H 2 : the object belongs to class 2
H 3 : the object belongs to class 3
.
.
.
etc
The data D consists of the feature values which describe the object
So we must evaluate the following posterior probabilities
P( H1 | D)  P( D | H1 ) P( H1 )
P( H 2 | D)  P( D | H 2 ) P( H 2 )
P( H 3 | D)  P( D | H 3 ) P ( H 3 )
.
.
.
The priors tell us the prior probability that an object will be in each class. It
is possible that some classes may be more probable a priori than others.
And the likelihoods tell us how probable an object in each class would have
those feature values.
For example, above we tried to assign an object X to one of two classes:
“man” or “woman”. In that case there was only one feature value: the weight
w of X.
Decision Surfaces
Once we have calculated all the posteriors we assign the object to the class
with the largest posterior.
In principle we could evaluate the posteriors for every point in feature space.
We could find the regions of feature space where each class has the largest
posterior. The boundaries between these regions are the decision surfaces.
The decision surface between class i and class j is given by the set of points
x where
P( H i | x)  P( H j | x)
Models
We assume we have a model for each class which tells us how probable any
set of features is for that class i.e. it tells us P( D | H i ) .
Features
D
model
P( D | H i )
We have already encountered various models in this course.
We had a model of how a millionaire behaves: he probably drives Porsches
and wears Armani suits.
We had a model of how drops of water fall from a leaky ceiling. We
assumed they fell randomly within a one meter interval.
We had a model of how a vase might be knocked over. We assumed that a
cat was more likely to knock one over than a fly.
All these models give the probability of particular piece of data for a
particular hypothesis.
The Normal Distribution
One of the most commonly used models is the Normal Distribution.
This is a continuous probability density function. The general shape is
shown below.
P(x)


m
x
This models the probability distribution of a continuous variable. Many
quantities in the real world behave this way, especially quantities related to
natural objects, e.g. people’s weights and heights.
The Normal Distribution has two parameters, called the mean m and the
standard deviation .
The mean represents the centre of the distribution. It is the value of x with
the largest probability. The mean determines the position of the distribution
on the x axis. The diagram below shows two examples of different
distributions with different means but the same standard deviation.
P(x)
m1
m2
x
The standard deviation represents the width of the distribution. The diagram
below shows different distributions with different standard deviations but the
same mean. Notice that as the standard deviation increases the height
decreases.
P(x)
m
x
As with all continuous probability density functions, the area under the
curve is 1.
The Normal distribution is symmetric about the mean. So the area under the
curve on either side of the mean is 0.5.
The height of the curve drops as it gets further away from the mean. But it
never drops to zero. There is always an infinitesimal probability no matter
how far away from the mean.
How to use the Normal Model to Classify Objects
Lets assume that the features of our object are continuous variables which
obey the Normal distribution. In other words we will adopt the Normal
model for our data.
Here is an example which we have considered before.
Let’s try to decide whether an unknown person X is male or female given
the weight of X.
Here X is the object we are trying to classify.
We have two classes “male” and “female”.
We have one feature, the weight w of X.
We now have to construct two models – one for each class. Here we are
assuming that both classes can be modelled by a Normal distribution but the
parameters of the models will be different for each class.
In other words, we are assuming that the weights w of men obey a Normal
distribution with one pair of values for the mean and standard deviation and
the weights w of women also obey a Normal distribution but with a different
pair of values for the mean and standard deviation.
This is illustrated below where we plot the distribution for both classes on
the same diagram.
men
women
P(w)
w
The right hand curve shows the probability that a man will have weight w.
The left hand curve shows the probability that a woman will have weight w.
The mean for the men is greater than the mean for women because men tend
to weigh more than women on average.
Let’s assume that we have accurate values for the means and standard
deviations. We could obtain these from some official Bureau of Statistics or
from medical journals.
Now we can put the above information into Bayes Theorem and decide
whether X is a man or a woman.
We have two hypotheses
H 1 : X is a man
H 2 : X is a woman
Bayes Theorem tells us
P( H1 | w)  P(w | H1 ) P( H1 )
P( H 2 | w)  P(w | H 2 ) P( H 2 )
Now in this case the two priors are equal. This is because the prior
probability that X is a man is equal to the prior probability that X is a
woman.
P( H1 )  P( H 2 )  1/ 2
Therefore the posterior probabilities are proportional to the likelihoods
P( H 1 | w)  P( w | H 1 )
P( H 2 | w)  P( w | H 2 )
But the likelihoods are determined by the Normal models which we
discussed above. P(w | H1 ) is given by the right-hand curve which describes
the distribution of weights for men and P(w | H 2 ) is given by the left-hand
curve which describes the weight distribution for women.
So we can now decide if X is a man or a woman.
If P(H1 | w)  P(H 2 | w) then X is probably a man
If P(H 2 | w)  P(H1 | w) then X is probably a woman
Lets look again at our two models
P(w)
S
w
The point S marks the value of w where the two curves are equal.
For values of w greater than S, P(H1 | w)  P(H 2 | w) so X is probably a man
For values of w less than S,
P( H 2 | w)  P( H1 | w) so X is probably a woman
So if we wish to tell if X is a man or a woman all we have to do is decide if
w > S or w < S.
Therefore S represents the decision boundary between the two classes.
In this case we have only one feature value. So the feature space is only onedimensional (i.e. it is a line). The decision boundary is just a point on that
line (i.e. the point S)
We can tell immediately that this classifier is not going to be 100% accurate.
The shaded region in the above diagram represents men with weights less
than S. These men are going to be misclassified as women.
Likewise women with weights greater than S are going to be misclassified as
men.
We should even be able to predict in advance what these misclassification
rates are going to be.
But the above classifier represents the best performance we can achieve if w
is the only information we have about X. If we want to improve performance
we are going to have to get some more information e.g. the height, waist
measurement etc.
How to calculate the Decision Boundary
The Normal distribution is described by the following equation
P( w) 
 ( w  m) 2 

exp  
2 2 
 2

1
where m is the mean and  is the standard deviation.
Now the two models which we used in the above example are determined by
two different pairs of values of m and  . Lets call them m1 and  1 and
m2 and  2 .
Model 1 is given by P( H 1 | w) 
Model 2 is given by P( H 2 | w) 
1
1
 ( w  m1 ) 2 

exp  
2

2 1
2


1
2
 ( w  m2 ) 2 

exp  
2

2 2
2


The decision boundary S is the point where these two equations are equal
If we take logarithms of both sides we get
( S  m1 ) 2
( S  m2 ) 2
log(  1 ) 
 log(  2 ) 
2 12
2 22
And re-arranging the above equation gives S. (Notice that this equation is a
quadratic. Therefore there will be two solutions for S)
We get an especially simple formula for S if the two standard deviations are
equal.
S
m1  m2
2
We can work out the error rates by looking up S in tables of the Normal
distribution.
In general there will be two decision boundaries. We can see this is very
clearly if we consider two models where the standard deviations differ
greatly
P(x)
S1
S2
x
Here we have a very broad model and a very narrow model.
The decision region for the narrow model falls inside the decision region for
the broad model. So there are two decision boundaries S1 and S2.