Download P(x | i )

Bayesian Decision Theory (Sections 2.1-2.2) • Decision problem posed in probabilistic terms • Bayesian Decision Theory–Continuous Features • All the relevant probability values are known Probability Density Jain CSE 802, Spring 2013 Course Outline MODEL INFORMATION COMPLETE Bayes Decision Theory Parametric Approach “Optimal” Rules Plug-in Rules INCOMPLETE Supervised Learning Nonparametric Approach Unsupervised Learning Parametric Approach Density Geometric Rules Estimation (K-NN, MLP) Mixture Resolving Nonparametric Approach Cluster Analysis (Hard, Fuzzy) Introduction • From sea bass vs. salmon example to “abstract” decision making problem • State of nature; a priori (prior) probability • State of nature (which type of fish will be observed next) is unpredictable, so it is a random variable • The catch of salmon and sea bass is equiprobable • P(1) = P(2) (uniform priors) • P(1) + P( 2) = 1 (exclusivity and exhaustivity) • Prior prob. reflects our prior knowledge about how likely we are to observe a sea bass or salmon; these probabilities may depend on time of the year or the fishing area! • Bayes decision rule with only the prior information • Decide 1 if P(1) > P(2), otherwise decide 2 • Error rate = Min {P(1) , P(2)} • Suppose now we have a measurement or feature • • on the state of nature - say the fish lightness value Use of the class-conditional probability density P(x | 1) and P(x | 2) describe the difference in lightness feature between populations of sea bass and salmon Amount of overlap between the densities determines the “goodness” of feature • Maximum likelihood decision rule • Assign input pattern x to class 1 if P(x | 1) > P(x | 2), otherwise 2 • How does the feature x influence our attitude (prior) concerning the true state of nature? • Bayes decision rule • Posteriori probability, likelihood, evidence • P(j , x) = P(j | x)p (x) = p(x | j) P (j) • Bayes formula P(j | x) = {p(x | j) . P (j)} / p(x) j2 where P ( x )   P ( x |  j )P (  j ) j 1 • Posterior = (Likelihood. Prior) / Evidence • Evidence P(x) can be viewed as a scale factor that • guarantees that the posterior probabilities sum to 1 P(x | j) is called the likelihood of j with respect to x; the category j for which P(x | j) is large is more likely to be the true category • • P(1 | x) is the probability of the state of nature being 1 given that feature value x has been observed Decision based on the posterior probabilities is called the Optimal Bayes Decision rule For a given observation (feature value) X: if P(1 | x) > P(2 | x) if P(1 | x) < P(2 | x) decide 1 decide 2 To justify the above rule, calculate the probability of error: P(error | x) = P(1 | x) if we decide 2 P(error | x) = P(2 | x) if we decide 1 • So, for a given x, we can minimize te rob. Of error, decide 1 if P(1 | x) > P(2 | x); otherwise decide 2 Therefore: P(error | x) = min [P(1 | x), P(2 | x)] • Thus, for each observation x, Bayes decision rule • minimizes the probability of error Unconditional error: P(error) obtained by integration over all x w.r.t. p(x) • Optimal Bayes decision rule Decide 1 if P(1 | x) > P(2 | x); otherwise decide 2 • Special cases: (i) P(1) = P(2); Decide 1 if p(x | 1) > p(x | 2), otherwise 2 (ii) p(x | 1) = p(x | 2); Decide 1 if P(1) > P(2), otherwise 2 Bayesian Decision Theory – Continuous Features • Generalization of the preceding formulation • Use of more than one feature (d features) • Use of more than two states of nature (c classes) • Allowing other actions besides deciding on the state of • nature Introduce a loss function which is more general than the probability of error • Allowing actions other than classification primarily allows the possibility of rejection • Refusing to make a decision when it is difficult to decide between two classes or in noisy cases! • The loss function specifies the cost of each action • Let {1, 2,…, c} be the set of c states of nature (or “categories”) • Let {1, 2,…, a} be the set of a possible actions • Let (i | j) be the loss incurred for taking action i when the true state of nature is j • General decision rule (x) specifies which action to take for every possible observation x j c Conditional Risk R(  i | x )    (  i |  j )P (  j | x ) j 1 For a given x, suppose we take the action i ; if the true state is j , we will incur the loss (i | j). P(j | x) is the prob. that the true state is j But, any one of the C states is possible for given x. Overall risk R = Expected value of R(i | x) w.r.t. p(x) Conditional risk Minimizing R Minimize R(i | x) for i = 1,…, a Select the action i for which R(i | x) is minimum The overall risk R is minimized and the resulting risk is called the Bayes risk; it is the best performance that can be achieved! • Two-category classification 1 : deciding 1 2 : deciding 2 ij = (i | j) loss incurred for deciding i when the true state of nature is j Conditional risk: R(1 | x) = 11P(1 | x) + 12P(2 | x) R(2 | x) = 21P(1 | x) + 22P(2 | x) Bayes decision rule is stated as: if R(1 | x) < R(2 | x) Take action 1: “decide 1” This results in the equivalent rule: decide 1 if: (21- 11) P(x | 1) P(1) > (12- 22) P(x | 2) P(2) and decide 2 otherwise Likelihood ratio: The preceding rule is equivalent to the following rule: P ( x |  1 ) 12   22 P (  2 ) if  . P ( x |  2 )  21  11 P (  1 ) then take action 1 (decide 1); otherwise take action 2 (decide 2) Note that the posteriori porbabilities are scaled by the loss differences. Interpretation of the Bayes decision rule: “If the likelihood ratio of class 1 and class 2 exceeds a threshold value (that is independent of the input pattern x), the optimal action is to decide 1” Maximum likelihood decision rule: the threshold value is 1; 0-1 loss function and equal class prior probability Bayesian Decision Theory (Sections 2.3-2.5) • Minimum Error Rate Classification • Classifiers, Discriminant Functions and Decision Surfaces • The Normal Density Minimum Error Rate Classification • Actions are decisions on classes If action i is taken and the true state of nature is j then: the decision is correct if i = j and in error if i  j • Seek a decision rule that minimizes the probability of error or the error rate • Zero-one (0-1) loss function: no loss for correct decision and a unit loss for any error 0 i  j  (  i , j )   1 i  j i , j  1 ,..., c The conditional risk can now be simplified as: j c R(  i | x )    (  i |  j )P (  j | x ) j 1   P(  j | x )  1  P(  i | x ) j 1 “The risk corresponding to the 0-1 loss function is the average probability of error” • Minimizing the risk requires maximizing the posterior probability P(i | x) since R(i | x) = 1 – P(i | x)) • For Minimum error rate • Decide i if P (i | x) > P(j | x) j  i • Decision boundaries and decision regions 12  22 P (  2 ) P( x | 1 ) Let .    then decide  1 if :   21  11 P (  1 ) P( x |  2 ) • If  is the 0-1 loss function then the threshold involves only the priors:  0 1  1 0    then    P(  2 )  a P( 1 ) 0 2  2 P(  2 )  then    if     b P( 1 ) 1 0 Classifiers, Discriminant Functions and Decision Surfaces • Many different ways to represent pattern classifiers; one of the most useful is in terms of discriminant functions • The multi-category case • Set of discriminant functions gi(x), i = 1,…,c • Classifier assigns a feature vector x to class i if: gi(x) > gj(x) j  i Network Representation of a Classifier • Bayes classifier can be represented in this way, but the choice of discriminant function is not unique • gi(x) = - R(i | x) (max. discriminant corresponds to min. risk!) • For the minimum error rate, we take gi(x) = P(i | x) (max. discrimination corresponds to max. posterior!) gi(x)  P(x | i) P(i) gi(x) = ln P(x | i) + ln P(i) (ln: natural logarithm!) • Effect of any decision rule is to divide the feature space into c decision regions if gi(x) > gj(x) j  i then x is in Ri (Region Ri means assign x to i) • The two-category case • Here a classifier is a “dichotomizer” that has two discriminant functions g1 and g2 Let g(x)  g1(x) – g2(x) Decide 1 if g(x) > 0 ; Otherwise decide 2 • So, a “dichotomizer” computes a single discriminant function g(x) and classifies x according to whether g(x) is positive or not. • Computation of g(x) = g1(x) – g2(x) g( x )  P (  1 | x )  P (  2 | x ) P( x | 1 ) P( 1 )  ln  ln P( x |  2 ) P(  2 ) The Normal Density • Univariate density: N( , 2) • Normal density is analytically tractable • Continuous density • A number of processes are asymptotically Gaussian • Patterns (e.g., handwritten characters, speech signals ) can be viewed as randomly corrupted versions of a single typical or prototype (Central Limit theorem) P( x )  2  1 1 x  exp     , 2   2     where:  = mean (or expected value) of x 2 = variance (or expected squared deviation) of x • Multivariate density: N( , ) • Multivariate normal density in d dimensions: P( x )  1 ( 2 ) d/2  1/ 2  1  t 1 exp  ( x   )  ( x   )  2  where: x = (x1, x2, …, xd)t (t stands for the transpose of a vector)  = (1, 2, …, d)t mean vector  = d*d covariance matrix • • • || and -1 are determinant and inverse of , respectively The covariance matrix is always symmetric and positive semidefinite; we assume  is positive definite so the determinant of  is strictly positive Multivariate normal density is completely specified by [d + d(d+1)/2] parameters If variables x1 and x2 are statistically independent then the covariance of x1 and x2 is zero. Multivariate Normal density Samples drawn from a normal population tend to fall in a single cloud or cluster; cluster center is determined by the mean vector and shape by the covariance matrix The loci of points of constant density are hyperellipsoids whose principal axes are the eigenvectors of  r 2  ( x   )t 1 ( x   ) Transformation of Normal Variables Linear combinations of jointly normally distributed random variables are normally distributed Coordinate transformation can convert an arbitrary multivariate normal distribution into a spherical one Bayesian Decision Theory (Sections 2-6 to 2-9) • Discriminant Functions for the Normal Density • Bayes Decision Theory – Discrete Features Discriminant Functions for the Normal Density • The minimum error-rate classification can be achieved by the discriminant function gi(x) = ln P(x | i) + ln P(i) • In case of multivariate normal densities 1 1 d 1 t g i ( x )   ( x   i )  ( x   i )  ln 2  ln  i  ln P (  i ) 2 2 2 i • Case i = 2.I (I is the identity matrix) Features are statistically independent and each feature has the same variance g i ( x )  w x  w i 0 (linear discriminant function) t i where : i 1 t wi  2 ; wi 0    i  i  ln P (  i ) 2  2 (  i 0 is called the threshold for the ith category! ) • A classifier that uses linear discriminant functions is called “a linear machine” • The decision surfaces for a linear machine are pieces of hyperplanes defined by the linear equations: gi(x) = gj(x) • The hyperplane separating Ri and Rj 1 2 x0  (  i   j )  2 i   j 2 P(  i ) ln ( i   j ) P(  j ) is orthogonal to the line linking the means! 1 if P (  i )  P (  j ) then x0  (  i   j ) 2 • Case 2: i =  (covariance matrices of all classes are identical but otherwise arbitrary!) • Hyperplane separating Ri and Rj   ln P (  i ) / P (  j ) 1 x0  (  i   j )  .(  i   j ) t 1 2 ( i   j )  ( i   j ) • The hyperplane separating Ri and Rj is generally not orthogonal to the line between the means! • To classify a feature vector x, measure the squared Mahalanobis distance from x to each of the c means; assign x to the category of the nearest mean Discriminant Functions for 1D Gaussian • Case 3: i = arbitrary • The covariance matrices are different for each category g i ( x )  x tWi x  w it x  w i 0 where : 1 1 Wi    i 2 w i   i 1  i 1 t 1 1 w i 0    i  i  i  ln  i  ln P (  i ) 2 2 In the 2-category case, the decision surfaces are hyperquadrics that can assume any of the general forms: hyperplanes, pairs of hyperplanes, hyperspheres, hyperellipsoids, hyperparaboloids, hyperhyperboloids) Discriminant Functions for the Normal Density Discriminant Functions for the Normal Density Discriminant Functions for the Normal Density Decision Regions for Two-Dimensional Gaussian Data x2  3.514  1.125x1  0.1875x12 Error Probabilities and Integrals • 2-class problem • There are two types of errors • Multi-class problem – Simpler to computer the prob. of being correct (more ways to be wrong than to be right) Error Probabilities and Integrals Bayes optimal decision boundary in 1-D case Error Bounds for Normal Densities • • The exact calculation of the error for the general Guassian case (case 3) is extremely difficult However, in the 2-category case the general error can be approximated analytically to give us an upper bound on the error Error Rate of Linear Discriminant Function (LDF) • Assume a 2-class problem p  x   ~ N (  , ), p  x   ~ N (  , ) 1 1 2 2 1 gi ( x)  log  P( x i )    ( x   i )t  1 ( x   i )  log  P(i )  2 • Due to the symmetry of the problem (identical ), the two types of errors are identical • Decide x  if g ( x)  g ( x) or 1 1 2 1 1 t 1  ( x   1 )  ( x   1 )  log  P(1 )    ( x   2 )t  1 ( x   2 )  log  P(2 )  2 2 or   1 t 1 t (  2   1 )  x   1  1   2  1  2  log  P(1 ) / P(2 )  2 t 1 Error Rate of LDF • Let h( x)  (    )  x  12          • Compute expected values & variances of when x  1 & x  2 t 2 1 1 t 1 1 t 1 1 2 1  E  h( x) x  1   (  2   1 )t  1 E  x 1   1   (  2   1 )t  1 (  2   1 ) 2   where 2  h( x) 1 t 1  1  1   t2 1  2 2  1 (  2  1 )t 1 (  2  1 ) = squared Mahalanobis distance between 1 &  2 Error Rate of LDF • Similarly 1 2 2   (  2  1 )t 1 (  2  1 )    12  E  h( x)  1  1   E (  2   1 )t  1 ( x   1 ) x  1  2    (  2   1 )t  1 (  2   1 )  2  22  2 p  h( x) x  1  ~ N ( , 2 ) p  h( x) x  2  ~ N ( , 2 ) Error Rate of LDF  1  P  g1 ( x)  g 2 ( x) x  1    P  h( x) 1  dh t    n t 2 1  2 1 e 2 d 2 1 1   erf 2 2  t     4  h( x ) ~ 1  () 1 e 2 2  2 Error Rate of LDF  P (1 )  t  log   P (  )  2  2 erf (r )   1 1  2   erf 2 2 r e  x2 dx 0  t   4    Total probability of error Pe  P (1 )1  P (2 ) 2 Error Rate of LDF 1 P 1   P 2    t 0 2    1 1 1 1 1   2   erf     erf 2 2  4  2 2  (  1   2 )t  1 (  1   2 )      2 2   (i) No Class Separation (  1   2 )t  1 (  1   2 )  0  1   2  1 2 (ii) Perfect Class Separation (  1   2 )t  1 (  1   2 )  0 ∞  1   2  0 (erf  1) Mahalanobis distance is a good measure of separation between classes Chernoff Bound • To derive a bound for the error, we need the following inequality Assume conditional prob. are normal where Chernoff Bound Chernoff bound for P(error) is found by determining the value of  that minimizes exp(-k()) Error Bounds for Normal Densities • Bhattacharyya Bound • Assume  = 1/2 • computationally simpler • slightly less tight bound • Now, Eq. (73) has the form When the two covariance matrices are equal, k(1/2) is te same as the Mahalanobis distance between the two means Error Bounds for Gaussian Distributions Chernoff Bound P(error )  P  (1 ) P1  (1 )  p  ( x | 1 ) p1  ( x | 2 )dx p  0   1 ( x | 1 ) p1  ( x | 2 )dx  e  k (  ) k ( )   (1  ) 2 1 t (  2  1 ) [ 1  (1   )  2 ] (  2  1 )  1 1 (1  ) 2 ln 2 |1| |2 |1 Best Chernoff error bound is 0.008190 Bhattacharya Bound (β=1/2) P(error )  P(1 ) P(2 )  P( x | 1 ) P( x | 2 )dx  P(1 ) P(2 )e  k (1/2) k (1 / 2)  1 / 8(  2  1 ) t 2  1   2  1 2–category, 2D data 1 2 1 2 (  2  1 )  ln 2 |1 ||2 | Bhattacharya error bound is 0.008191 True error using numerical integration = 0.0021 Neyman-Pearson Rule “Classification, Estimation and Pattern recognition” by Young and Calvert Neyman-Pearson Rule Neyman-Pearson Rule Neyman-Pearson Rule Neyman-Pearson Rule Neyman-Pearson Rule Signal Detection Theory We are interested in detecting a single weak pulse, e.g. radar reflection; the internal signal (x) in detector has mean m1 (m2) when pulse is absent (present) p( x | 1 ) ~ N ( 1 ,  2 ) p ( x | 2 ) ~ N (  2 ,  2 ) The detector uses a threshold x* to determine the presence of pulse Discriminability: ease of determining whether the pulse is present or not d' | 1  2 |  For given threshold, define hit, false alarm, miss and correct rejection P( x  x*| x  2 ) : P( x  x*| x  1 ) : P( x  x*| x  2 ) : P( x  x*| x  1 ) : hit false alarm miss correct rejection Receiver Operating Characteristic (ROC) • Experimentally compute hit and false alarm rates for • • fixed x* Changing x* will change the hit and false alarm rates A plot of hit and false alarm rates is called the ROC curve Performance shown at different operating points Operating Characteristic • In practice, distributions may not be Gaussian • and will be multidimensional; ROC curve can still be plotted Vary a single control parameter for the decision rule and plot the resulting hit and false alarm rates Bayes Decision Theory – Discrete Features • • Components of x are binary or integer valued; x can take only one of m discrete values v1, v2, …,vm Case of independent binary features for 2-category problem Let x = [x1, x2, …, xd ]t where each xi is either 0 or 1, with probabilities: pi = P(xi = 1 | 1) qi = P(xi = 1 | 2) • The discriminant function in this case is: d g ( x )   w i x i  w0 i 1 where : pi ( 1  q i ) w i  ln q i ( 1  pi ) i  1 ,..., d and : 1  pi P( 1 ) w0   ln  ln 1  qi P(  2 ) i 1 d decide  1 if g(x)  0 and  2 if g(x)  0 Bayesian Decision for Three-dimensional Binary Data • Consider a 2-class problem with three independent binary features; class priors are equal and pi = 0.8 and qi = 0.5, i = 1,2,3 • wi = 1.3863 • w0 = 1.2 • Decision surface g(x) = 0 is shown below Decision boundary for 3D binary features. Left figure shows the case when pi=.8 and qi=.5. Right figure shows case when p3=q3 (Feature 3 is not providing any discriminatory information) so decision surface is parallel to x3 axis Handling Missing Features • Suppose it is not possible to measure a certain feature for a given pattern • Possible solutions: • Reject the pattern • Approximate the missing feature • • Mean of all the available values for the missing feature Marginalize over the distribution of the missing feature Handling Missing Features Other Topics • Compound Bayes Decision Theory & Context – Consecutive states of nature might not be statistically independent; in sorting two types of fish, arrival of next fish may not be independent of the previous fish – Can we exploit such statistical dependence to gain improved performance (use of context) – Compound decision vs. sequential compound decision problems – Markov dependence • Sequential Decision Making – Feature measurement process is sequential (as in medical diagnosis) – Feature measurement cost – Minimize the no. of features to be measured while achieving a sufficient accuracy; minimize a combination of feature measurement cost & classification accuracy Context in Text Recognition

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download P(x | i )