Download Bayes Theorem and Classification

3. Model-based Pattern Recognition In this section we use Bayes Theorem in two different ways  to learn models of data  to use the models to classify new data Bayes Theorem and Classification Now we return to the problem of how to assign an object to one of a set of classes. Here the hypotheses that we have to consider are H 1 : the object belongs to class 1 H 2 : the object belongs to class 2 H 3 : the object belongs to class 3 . . . etc The data D consists of the feature values which describe the object So we must evaluate the following posterior probabilities P( H1 | D)  P( D | H1 ) P( H1 ) P( H 2 | D)  P( D | H 2 ) P( H 2 ) P( H 3 | D)  P( D | H 3 ) P ( H 3 ) . . . The priors tell us the prior probability that an object will be in each class. It is possible that some classes may be more probable a priori than others. And the likelihoods tell us how probable an object in each class would have those feature values. For example, above we tried to assign an object X to one of two classes: “man” or “woman”. In that case there was only one feature value: the weight w of X. Decision Surfaces Once we have calculated all the posteriors we assign the object to the class with the largest posterior. In principle we could evaluate the posteriors for every point in feature space. We could find the regions of feature space where each class has the largest posterior. The boundaries between these regions are the decision surfaces. The decision surface between class i and class j is given by the set of points x where P( H i | x)  P( H j | x) Models We assume we have a model for each class which tells us how probable any set of features is for that class i.e. it tells us P( D | H i ) . Features D model P( D | H i ) We have already encountered various models in this course. We had a model of how a millionaire behaves: he probably drives Porsches and wears Armani suits. We had a model of how drops of water fall from a leaky ceiling. We assumed they fell randomly within a one meter interval. We had a model of how a vase might be knocked over. We assumed that a cat was more likely to knock one over than a fly. All these models give the probability of particular piece of data for a particular hypothesis. The Normal Distribution One of the most commonly used models is the Normal Distribution. This is a continuous probability density function. The general shape is shown below. P(x)   m x This models the probability distribution of a continuous variable. Many quantities in the real world behave this way, especially quantities related to natural objects, e.g. people’s weights and heights. The Normal Distribution has two parameters, called the mean m and the standard deviation . The mean represents the centre of the distribution. It is the value of x with the largest probability. The mean determines the position of the distribution on the x axis. The diagram below shows two examples of different distributions with different means but the same standard deviation. P(x) m1 m2 x The standard deviation represents the width of the distribution. The diagram below shows different distributions with different standard deviations but the same mean. Notice that as the standard deviation increases the height decreases. P(x) m x As with all continuous probability density functions, the area under the curve is 1. The Normal distribution is symmetric about the mean. So the area under the curve on either side of the mean is 0.5. The height of the curve drops as it gets further away from the mean. But it never drops to zero. There is always an infinitesimal probability no matter how far away from the mean. How to use the Normal Model to Classify Objects Lets assume that the features of our object are continuous variables which obey the Normal distribution. In other words we will adopt the Normal model for our data. Here is an example which we have considered before. Let’s try to decide whether an unknown person X is male or female given the weight of X. Here X is the object we are trying to classify. We have two classes “male” and “female”. We have one feature, the weight w of X. We now have to construct two models – one for each class. Here we are assuming that both classes can be modelled by a Normal distribution but the parameters of the models will be different for each class. In other words, we are assuming that the weights w of men obey a Normal distribution with one pair of values for the mean and standard deviation and the weights w of women also obey a Normal distribution but with a different pair of values for the mean and standard deviation. This is illustrated below where we plot the distribution for both classes on the same diagram. men women P(w) w The right hand curve shows the probability that a man will have weight w. The left hand curve shows the probability that a woman will have weight w. The mean for the men is greater than the mean for women because men tend to weigh more than women on average. Let’s assume that we have accurate values for the means and standard deviations. We could obtain these from some official Bureau of Statistics or from medical journals. Now we can put the above information into Bayes Theorem and decide whether X is a man or a woman. We have two hypotheses H 1 : X is a man H 2 : X is a woman Bayes Theorem tells us P( H1 | w)  P(w | H1 ) P( H1 ) P( H 2 | w)  P(w | H 2 ) P( H 2 ) Now in this case the two priors are equal. This is because the prior probability that X is a man is equal to the prior probability that X is a woman. P( H1 )  P( H 2 )  1/ 2 Therefore the posterior probabilities are proportional to the likelihoods P( H 1 | w)  P( w | H 1 ) P( H 2 | w)  P( w | H 2 ) But the likelihoods are determined by the Normal models which we discussed above. P(w | H1 ) is given by the right-hand curve which describes the distribution of weights for men and P(w | H 2 ) is given by the left-hand curve which describes the weight distribution for women. So we can now decide if X is a man or a woman. If P(H1 | w)  P(H 2 | w) then X is probably a man If P(H 2 | w)  P(H1 | w) then X is probably a woman Lets look again at our two models P(w) S w The point S marks the value of w where the two curves are equal. For values of w greater than S, P(H1 | w)  P(H 2 | w) so X is probably a man For values of w less than S, P( H 2 | w)  P( H1 | w) so X is probably a woman So if we wish to tell if X is a man or a woman all we have to do is decide if w > S or w < S. Therefore S represents the decision boundary between the two classes. In this case we have only one feature value. So the feature space is only onedimensional (i.e. it is a line). The decision boundary is just a point on that line (i.e. the point S) We can tell immediately that this classifier is not going to be 100% accurate. The shaded region in the above diagram represents men with weights less than S. These men are going to be misclassified as women. Likewise women with weights greater than S are going to be misclassified as men. We should even be able to predict in advance what these misclassification rates are going to be. But the above classifier represents the best performance we can achieve if w is the only information we have about X. If we want to improve performance we are going to have to get some more information e.g. the height, waist measurement etc. How to calculate the Decision Boundary The Normal distribution is described by the following equation P( w)   ( w  m) 2   exp   2 2   2  1 where m is the mean and  is the standard deviation. Now the two models which we used in the above example are determined by two different pairs of values of m and  . Lets call them m1 and  1 and m2 and  2 . Model 1 is given by P( H 1 | w)  Model 2 is given by P( H 2 | w)  1 1  ( w  m1 ) 2   exp   2  2 1 2   1 2  ( w  m2 ) 2   exp   2  2 2 2   The decision boundary S is the point where these two equations are equal If we take logarithms of both sides we get ( S  m1 ) 2 ( S  m2 ) 2 log(  1 )   log(  2 )  2 12 2 22 And re-arranging the above equation gives S. (Notice that this equation is a quadratic. Therefore there will be two solutions for S) We get an especially simple formula for S if the two standard deviations are equal. S m1  m2 2 We can work out the error rates by looking up S in tables of the Normal distribution. In general there will be two decision boundaries. We can see this is very clearly if we consider two models where the standard deviations differ greatly P(x) S1 S2 x Here we have a very broad model and a very narrow model. The decision region for the narrow model falls inside the decision region for the broad model. So there are two decision boundaries S1 and S2.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Bayes Theorem and Classification