Download P(x | i )

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000 with the permission of the authors and the publisher Bayes decision theory febr. 17. 2 Classification Supervised learning: Based on training examples (E), learn a modell which works fine on previously unseen examples. Classification: a supervised learning task of categorisation of entities into predefined set of classes 3 Pattern Classification, Chapter 2 (Part 1) 4 Posterior, likelihood, evidence P(j | x) = P(x | j) . P (j) / P(x) Posterior = (Likelihood. Prior) / Evidence Where in case of two categories j2 P ( x )   P ( x |  j )P (  j ) j 1 Pattern Classification, Chapter 2 (Part 1) 5 Pattern Classification, Chapter 2 (Part 1) 6 Bayesian Decision • Decision given the posterior probabilities X is an observation for which: if P(1 | x) > P(2 | x) if P(1 | x) < P(2 | x) True state of nature = 1 True state of nature = 2 This rule minimizes the probability of the error. Pattern Classification, Chapter 2 (Part 1) Bayesian Decision Theory – Generalization 7 • Use of more than one feature • Use more than two states of nature • Allowing actions and not only decide on the state of • nature Introduce a loss of function which is more general than the probability of error Pattern Classification, Chapter 2 (Part 1) 8 Let {1, 2,…, c} be the set of c states of nature (or “categories”) Let {1, 2,…, a} be the set of possible actions Let (i | j) be the loss incurred for taking action i when the state of nature is j Pattern Classification, Chapter 2 (Part 1) Bayes decision theory example Automatic trading (on stock exchanges) 1: the prices will increase (in the future!) 2: the prices will be lower 3: the prices won’t change too much We cannot observe  (latent)! 1: buy 2: sell x: actual prices (and historical prices) x is observed : how much to lose with an action 10 Overall risk R = Sum of all R(i | x) for i = 1,…,a Conditional risk Minimizing R Minimizing R(i | x) for i = 1,…, a j c R(  i | x )    (  i |  j )P (  j | x ) j 1 for i = 1,…,a Pattern Classification, Chapter 2 (Part 1) 11 Select the action i for which R(i | x) is minimum R is minimum and R in this case is called the Bayes risk = best performance that can be achieved! Pattern Classification, Chapter 2 (Part 1) 12 • Two-category classification 1 : deciding 1 2 : deciding 2 ij = (i | j) loss incurred for deciding i when the true state of nature is j Conditional risk: R(1 | x) = 11P(1 | x) + 12P(2 | x) R(2 | x) = 21P(1 | x) + 22P(2 | x) Pattern Classification, Chapter 2 (Part 1) 13 Our rule is the following: if R(1 | x) < R(2 | x) action 1: “decide 1” is taken This results in the equivalent rule : decide 1 if: (21- 11) P(x | 1) P(1) > (12- 22) P(x | 2) P(2) and decide 2 otherwise Pattern Classification, Chapter 2 (Part 1) 14 Likelihood ratio: The preceding rule is equivalent to the following rule: P ( x |  1 ) 12   22 P (  2 ) if  . P ( x |  2 )  21  11 P (  1 ) Then take action 1 (decide 1) Otherwise take action 2 (decide 2) Pattern Classification, Chapter 2 (Part 1) 15 Exercise Select the optimal decision where: = {1, 2} P(x | 1) P(x | 2) P(1) = 2/3 P(2) = 1/3 N(2, 0.5) (Normal distribution) N(1.5, 0.2) 1 2   3 4   Pattern Classification, Chapter 2 (Part 1) Zero-one loss function (Bayes Classifier) 0 i  j  (  i , j )   1 i  j 16 i , j  1 ,..., c Therefore, the conditional risk is: j c R(  i | x )    (  i |  j )P (  j | x ) j 1   P(  j | x )  1  P(  i | x ) j 1 “The risk corresponding to this loss function is the average probability error” Pattern Classification, Chapter 2 (Part 2) Classifiers, Discriminant Functions and Decision Surfaces 17 • The multi-category case • Set of discriminant functions gi(x), i = 1,…, c • The classifier assigns a feature vector x to class i if: gi(x) > gj(x) j  i Pattern Classification, Chapter 2 (Part 2) 18 • Let gi(x) = - R(i | x) (max. discriminant corresponds to min. risk!) • For the minimum error rate, we take gi(x) = P(i | x) (max. discrimination corresponds to max. posterior!) gi(x)  P(x | i) P(i) gi(x) = ln P(x | i) + ln P(i) (ln: natural logarithm!) Pattern Classification, Chapter 2 (Part 2) 19 • Feature space divided into c decision regions if gi(x) > gj(x) j  i then x is in Ri (Ri means assign x to i) • The two-category case • A classifier is a “dichotomizer” that has two discriminant functions g1 and g2 Let g(x)  g1(x) – g2(x) Decide 1 if g(x) > 0 ; Otherwise decide 2 Pattern Classification, Chapter 2 (Part 2) 20 • The computation of g(x) g( x )  P (  1 | x )  P (  2 | x ) P( x | 1 ) P( 1 )  ln  ln P( x |  2 ) P(  2 ) Pattern Classification, Chapter 2 (Part 2) 21 Discriminant functions of the Bayes Classifier with Normal Density Pattern Classification, Chapter 2 (Part 1) 22 • The Normal Density Univariate density • • • • Density which is analytically tractable Continuous density A lot of processes are asymptotically Gaussian Handwritten characters, speech sounds are ideal or prototype corrupted by random process (central limit theorem) P( x )  2  1 1 x  exp     , 2   2     Where:  = mean (or expected value) of x 2 = expected squared deviation or variance Pattern Classification, Chapter 2 (Part 2) 23 Pattern Classification, Chapter 2 (Part 2) 24 • Multivariate density • Multivariate normal density in d dimensions is: P( x )  1 ( 2 ) d/2  1/ 2  1  t 1 exp  ( x   )  ( x   )  2  where: x = (x1, x2, …, xd)t (t stands for the transpose vector form)  = (1, 2, …, d)t mean vector  = d*d covariance matrix || and -1 are determinant and inverse respectively Pattern Classification, Chapter 2 (Part 2) Discriminant Functions for the Normal Density 25 • We saw that the minimum error-rate classification can be achieved by the discriminant function gi(x) = ln P(x | i) + ln P(i) • Case of multivariate normal 1 1 d 1 t g i ( x )   ( x   i )  ( x   i )  ln 2  ln  i  ln P (  i ) 2 2 2 i Pattern Classification, Chapter 2 (Part 3) 26 • Case i = 2.I (I stands for the identity matrix) g i ( x )  w it x  w i 0 (linear discriminant function) where : i 1 t wi  2 ; wi 0    i  i  ln P (  i ) 2  2 (  i 0 is called the threshold for the ith category! ) Pattern Classification, Chapter 2 (Part 3) 27 • A classifier that uses linear discriminant functions is called “a linear machine” • The decision surfaces for a linear machine are pieces of hyperplanes defined by: gi(x) = gj(x) Pattern Classification, Chapter 2 (Part 3) 28 The hyperplane is always orthogonal to the line linking the means! Pattern Classification, Chapter 2 (Part 3) 29 • The hyperplane separating Ri and Rj 1 2 x0  (  i   j )  2 i   j 2 P(  i ) ln ( i   j ) P(  j ) always orthogonal to the line linking the means! 1 if P (  i )  P (  j ) then x0  (  i   j ) 2 Pattern Classification, Chapter 2 (Part 3) 30 Pattern Classification, Chapter 2 (Part 3) 31 Pattern Classification, Chapter 2 (Part 3) 32 • Case i =  (covariance of all classes are identical but arbitrary!) • Hyperplane separating Ri and Rj   ln P (  i ) / P (  j ) 1 x0  (  i   j )  .(  i   j ) t 1 2 ( i   j )  ( i   j ) (the hyperplane separating Ri and Rj is generally not orthogonal to the line between the means!) Pattern Classification, Chapter 2 (Part 3) 33 Pattern Classification, Chapter 2 (Part 3) 34 Pattern Classification, Chapter 2 (Part 3) 35 • Case i = arbitrary • The covariance matrices are different for each category g i ( x )  x tWi x  w it x  w i 0 where : 1 1 Wi    i 2 w i   i 1  i 1 t 1 1 w i 0    i  i  i  ln  i  ln P (  i ) 2 2 (Hyperquadrics which are: hyperplanes, pairs of hyperplanes, hyperspheres, hyperellipsoids, hyperparaboloids, hyperhyperboloids) Pattern Classification, Chapter 2 (Part 3) 36 Pattern Classification, Chapter 2 (Part 3) 37 Pattern Classification, Chapter 2 (Part 3) 38 Exercise Select the optimal decision where: = {1, 2} P(x | 1) P(x | 2) N(2, 0.5) (Normal distribution) N(1.5, 0.2) P(1) = 2/3 P(2) = 1/3 Pattern Classification, Chapter 2 39 Parameter estimation Pattern Classification, Chapter 3 • Data availability in a Bayesian framework • We could design an optimal classifier if we knew: • • P(i) (priors) P(x | i) (class-conditional densities) Unfortunately, we rarely have this complete information! • Design a classifier from a training sample • No problem with prior estimation • Samples are often too small for class-conditional estimation (large dimension of feature space!) 1 • A priori information about the problem • E.g. assume normality of P(x | i) P(x | i) ~ N( i, i) Characterized by 2 parameters • Estimation techniques • Maximum-Likelihood (ML) and the Bayesian estimations • Results are nearly identical, but the approaches are different 1 • Parameters in ML estimation are fixed but unknown! • Best parameters are obtained by maximizing the probability of obtaining the samples observed • Bayesian methods view the parameters as random variables having some known distribution • In either approach, we use P(i | x) for our classification rule! 1 • Use the information provided by the training samples to estimate  = (1, 2, …, c), each i (i = 1, 2, …, c) is associated with each category • Suppose that D contains n samples, x1, x2,…, xn k n P ( D |  )   P ( xk |  ) k 1 P( D |  ) is called the likelihood of  w.r.t. the set of samples) • ML estimate of  is, by definition the value that maximizes P(D | ) “It is the value of  that best agrees with the actually observed training sample” 2 • Optimal estimation • Let  = (1, 2, …, p)t and let  be the gradient operator         , ,...,   p   1  2 • • t We define l() as the log-likelihood function l() = ln P(D | ) New problem statement: determine  that maximizes the log-likelihood ˆ  argmaxl()  2 Bayesian Estimation • In MLE  was supposed fix • In BE  is a random variable • The computation of posterior probabilities P(i | x) • lies at the heart of Bayesian classification Goal: compute P(i | x, D) Given the sample D, Bayes formula can be written P(i | x, D )  P(x | i , D ).P(i | D ) c  P(x |  j , D ).P( j | D ) j 1 ter 1 • Bayesian Parameter Estimation: Gaussian Case Goal: Estimate  using the a-posteriori density P( | D) • The univariate case: P( | D)  is the only unknown parameter P(x |  ) ~ N(,  2 ) 2 P( ) ~ N( 0 ,  0 ) (0 and 0 are known!) 47 P(D |  ).P( ) P( | D )   P(D |  ).P( )d (1) k n    P(x k |  ).P( ) k 1 • Reproducing density P( | D ) ~ N(n , n2 ) Identifying (1) and (2) yields:  n 20  2  ˆ   n   . 0 2 2 n 2 2 n 0    n0  0    and n2   20  2 n 20   2 (2) • The univariate case P(x | D) • P( | D) computed • P(x | D) remains to be computed! P(x | D )   P(x | ).P( | D )d is Gaussian It provides: P(x | D ) ~ N(n , 2  n2 ) (Desired class-conditional density P(x | Dj, j)) Therefore: P(x | Dj, j) together with P(j) and using Bayes formula, we obtain the Bayesian classification rule:    Max P( j | x, D  Max P( x |  j , D j ).P( j ) j j  • Bayesian Parameter Estimation: General Theory • P(x | D) computation can be applied to any situation in which the unknown density can be parametrized: the basic assumptions are: • The form of P(x | ) is assumed known, but the value of  • • is not known exactly Our knowledge about  is assumed to be contained in a known prior density P() The rest of our knowledge  is contained in a set D of n random variables x1, x2, …, xn that follows P(x) 5 The basic problem is: “Compute the posterior density P( | D)” then “Derive P(x | D)” Using Bayes formula, we have: P(D | ).P() P( | D )  ,  P(D | ).P()d And by independence assumption: k n P(D | )   P(x k | ) k 1 52 MLE vs. Bayes estimation • If n→∞ they are equal! • MLE • Simple and fast (convex optimisation vs. numerical integration) • Bayes estimation • We can express our uncertainty by P() Sumamry • Bayes decision theory General framework for probabilistic decision making • Bayes classifier Classification is a special decision making (1 : choose 1) • Zero-one loss function  can be omitted • Bayes classifier with zero-one loss with Normal Density Summary • Parameter estimation • General procedures for densities’ parameter estimation based on a sample (it can be applied beyond Bayes classifier) • Bayesian Machine learning: the marrige of Bayesian Decision Theory and Parameter estimation from a training sample

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download P(x | i )