Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Bayesian Decision Theory Z. Ghassabi 1 Outline What is pattern recognition? What is classification? Need for Probabilistic Reasoning Probabilistic Classification Theory What is Bayesian Decision Theory? HISTORY PRIOR PROBABILITIES CLASS-CONDITIONAL PROBABILITIES BAYES FORMULA A Casual Formulation Decision 2 Outline What is Bayesian Decision Theory? Decision fot Two Categories 3 Outline What is classification? Classification by Bayesian Classification Basic Concepts Bayes Rule More General Forms of Bayes Rule Discriminated Functions Bayesian Belief Networks 4 What is pattern recognition? TYPICAL APPLICATIONS OF PR IMAGE PROCESSING EXAMPLE • Sorting Fish: incoming fish are sorted according to species using optical sensing (sea bass or salmon?) • Problem Analysis: set up a camera and take some sample images to extract features Consider features such as length, lightness, width, number and shape of fins, position of mouth, etc. Sensing Segmentation Feature Extraction 5 Pattern Classification System Preprocessing Feature Extraction Segment (isolate) fishes from one another and from the background Reduce the data by measuring certain features Classification Divide the feature space into decision regions 6 7 Classification Initially use the length of the fish as a possible feature for discrimination 8 TYPICAL APPLICATIONS LENGTH AS A DISCRIMINATOR • Length is a poor discriminator 9 Feature Selection The length is a poor feature alone! Select the lightness as a possible feature 10 TYPICAL APPLICATIONS ADD ANOTHER FEATURE • Lightness is a better feature than length because it reduces the misclassification error. • Can we combine features in such a way that we improve 11 performance? (Hint: correlation) Threshold decision boundary and cost relationship Move decision boundary toward smaller values of lightness in order to minimize the cost (reduce the number of sea bass that are classified salmon!) Task of decision theory 12 Feature Vector Adopt the lightness and add the width of the fish to the feature vector Fish xT = [x1, x2] Lightness Width 13 TYPICAL APPLICATIONS WIDTH AND LIGHTNESS Straight line decision boundary • Treat features as a N-tuple (two-dimensional vector) • Create a scatter plot • Draw a line (regression) separating the two classes 14 Features We might add other features that are not highly correlated with the ones we already have. Be sure not to reduce the performance by adding “noisy features” Ideally, you might think the best decision boundary is the one that provides optimal performance on the training data (see the following figure) 15 TYPICAL APPLICATIONS DECISION THEORY • Can we do better than a linear classifier? Is this a good decision boundary? • What is wrong with this decision surface? (hint: generalization) 16 Decision Boundary Choice Our satisfaction is premature because the central aim of designing a classifier is to correctly classify new (test) input Issue of generalization! 17 TYPICAL APPLICATIONS GENERALIZATION AND RISK Better decision boundary • Why might a smoother decision surface be a better choice? (hint: Occam’s Razor). • PR investigates how to find such “optimal” decision surfaces and how to provide system designers with the 18 tools to make intelligent trade-offs. Need for Probabilistic Reasoning Most everyday reasoning is based on uncertain evidence and inferences. Classical logic, which only allows conclusions to be strictly true or strictly false, does not account for this uncertainty or the need to weigh and combine conflicting evidence. Todays expert systems employed fairly ad hoc methods for reasoning under uncertainty and for combining evidence. 19 Probabilistic Classification Theory بنابراین هر الگو با، کالسها با هم همپوشانی دارندclassification در .یک احتمالی به یک کالس متعلق است In practice, some overlap between classes and random variation within classes occur, hence perfect separation between classes can not be achieved: Misclassification may occur. اجازه می دهد تا تابع هزینه ای ارائهBayesian decision تئوری شود برای موقعی که ممکن است که یک بردار ورودی که عضوی از کالس (misclassify) . نسبت داده شودB است اشتباها به کالسA 20 What is Bayesian Decision Theory? HISTORY Bayesian Probability was named after Reverend Thomas Bayes (1702-1761). He proved a special case of what is currently known as the Bayes Theorem. The term “Bayesian” came into use around the 1950’s. 21 http://en.wikipedia.org/wiki/Bayesian_probability HISTORY (Cont.) Pierre-Simon, Marquis de Laplace (17491827) independently proved a generalized version of Bayes Theorem. 1970 Bayesian Belief Network at Stanford University (Judea Pearl 1988) The idea’s proposed above was not fully developed until later. BBN became popular in the 1990s. 22 HISTORY (Cont.) Current uses of Bayesian Networks: Microsoft’s printer troubleshooter. Diagnose diseases (Mycin). Used to predict oil and stock prices Control the space shuttle Risk Analysis – Schedule and Cost Overruns. 23 BAYESIAN DECISION THEORY PROBABILISTIC DECISION THEORY Bayesian decision theory is a fundamental statistical approach to the problem of pattern classification. Using probabilistic approach to help making decision (e.g., classification) so as to minimize the risk (cost). Assume all relevant probability distributions are known (later we will learn how to estimate these from data). 24 BAYESIAN DECISION THEORY PRIOR PROBABILITIES State of nature is prior information denote the state of nature Model as a random variable, : = 1: the event that the next fish is a sea bass category 1: sea bass; category 2: salmon • A priori probabilities: P(1) = probability of category 1 P(2) = probability of category 2 But we know there will be many mistakes …. P(1) + P( 2) = 1 (either 1 or 2 must occur) • Decision rule Decide 1 if P(1) > P(2); otherwise, decide 2 http://www.stat.yale.edu/Courses/1997-98/101/ranvar.htm 25 BAYESIAN DECISION THEORY CLASS-CONDITIONAL PROBABILITIES A decision rule with only prior information always produces the same result and ignores measurements. If P(1) >> P( 2), we will be correct most of the time. • Given a feature, x (lightness), which is a continuous random variable, p(x|2) is the classconditional probability density function: • p(x|1) and p(x|2) describe the difference in lightness between populations of sea and salmon. 26 Let x be a continuous random variable. p(x|w) is the probability density for x given the state of nature w. p(lightness | salmon) ? P(lightness | sea bass) ? 27 BAYESIAN DECISION THEORY BAYES FORMULA How do we combine a priori and class-conditional probabilities to know the probability of a state of nature? Suppose we know both P(j) and p(x|j), and we can measure x. How does this influence our decision? The joint probability that of finding a pattern that is in category j and that this pattern has a feature value of x is: p( j , x ) P j x p x p x j P j • Rearranging terms, we arrive at Bayes formula. 28 A Casual FormulationDECISION THEORY BAYESIAN •The prior probability reflects knowledge of the POSTERIOR PROBABILITIES relative frequency of instances of a class •The likelihood is a measure of the probability that a For two categories: Bayes formula: measurement value occurs in p ax class j P j •The evidence is P a scaling j x term p x can be expressed in words as: likelihood prior posterior evidence By measuring x, we can convert the prior probability, P(j), into a posterior probability, P(j|x). Evidence can be viewed as a scale factor and is often ignored in optimization applications (e.g., speech recognition). Bayes Decision: 29 Choose w1 if P(w1|x) > P(w2|x); otherwise choose w2. BAYESIAN THEOREM A special case of Bayesian Theorem: P(A∩B) = P(B) x P(A|B) P(B∩A) = P(A) x P(B|A) Since P(A∩B) = P(B∩A), P(B) x P(A|B) = P(A) x P(B|A) => P(A|B) = [P(A) x P(B|A)] / P(B) A B P( A) P( B | A) P( A) P( B | A) P( A | B) P( B) P APB | A P A P B | A 30 Preliminaries and Notations i {1 , 2 ,, c } : a state of nature P(i ) : prior probability x : feature vector class-conditional p(x | i ) : density P(i | x) : posterior probability 31 Decision p(x | i ) P(i ) P(i | x) p ( x) D (x) arg max P(i | x) i Decide i if P(i|x) > P(j|x) ji The evidence, p(x), is a scale factor that assures conditional probabilities sum to 1: P(1|x)+P(2|x)=1 We can eliminate the scale factor (which appears on both sides of the equation): Decide i if p(x|i)P(i) > p(x|j)P(j) j i decision is based entirely Special cases: on the likelihood, p(x|j). 1. P(1)=P(2)= =P(c) x gives us no 2. p(x|1)=p(x|2) = = p(x|c) useful information 32 Decide i if P(i|x) > P(j|x) ji Decide i if p(x|i)P(i) > p(x|j)P(j) j i Two Categories Decide 1 if P(1|x) > P(2|x); otherwise decide 2 Decide 1 if p(x|1)P(1) > p(x|2)P(2); otherwise decide 2 Special cases: 1. P(1)=P(2) Decide 1 if p(x|1) > p(x|2); otherwise decide 2 2. p(x|1)=p(x|2) Decide 1 if P(1) > P(2); otherwise decide 2 33 Example Special cases: 1. P(1)=P(2) Decide 1 if p(x|> p(x|2); otherwise decide 1 2. p(x|1)=p(x|2) Decide 1 if P(1) > P(2); otherwise decide 2 R2 R1 P(1)=P(2) 34 Example P(1)=2/3 P(2)=1/3 R2 R2 R1 R1 Bayes Decision Rule Decide 1 if p(x|1)P(1) > p(x|2)P(2); otherwise decide 352 BAYESIAN DECISION THEORY POSTERIOR PROBABILITIES Two-class fish sorting problem (P(1) = 2/3, P(2) = 1/3): For every value of x, the posteriors sum to 1.0. At x=14, the probability it is in category 2 is 0.08, and for 36 category 1 is 0.92. BAYESIAN DECISION THEORY BAYES DECISION RULE Classification Error Decision rule: For an observation x, decide 1 if P(1|x) > P(2|x); otherwise, decide 2 Probability of error: P ( 2 x ) x 1 P error | x P ( 1 x ) x 2 The average probability of error is given by: P ( error ) P ( error , x )dx P ( error | x ) p( x )dx Consider two categories: P ( error | x ) min[ P ( 1 x ), P ( 2 x )] If for every x we ensure that P(error|x) is as small as possible, then the integral is as small as possible. Thus, Bayes decision 37 rule for minimizes P(error). CONTINUOUS FEATURES GENERALIZATION OF TWO-CLASS PROBLEM Generalization of the preceding ideas: Use of more than one feature (e.g., length and lightness) Use more than two states of nature (e.g., N-way classification) Allowing actions other than a decision to decide on the state of nature (e.g., rejection: refusing to take an action when alternatives are close or confidence is low) Introduce a loss of function which is more general than the probability of error (e.g., errors are not equally costly) Let us replace the scalar x by the vector x in a d-dimensional Euclidean space, Rd, called the feature space. 38 The Generation {1 , 2 ,, c } : {1 , 2 ,, a } : LOSS FUNCTION ij ( i | j ) : can be zero. a set of c states of nature or c categories a set of a possible actions The loss incurred for taking action i when the true state of nature is j. We want to minimize the expected loss in making decision. Risk 39 Examples Ex 1: Fish classification X= is the image of fish x =(brightness, length, fin #, etc.) is our belief what the fish type is C = {“sea bass”, “salmon”, “trout”, etc} is a decision for the fish type, in this case = {“sea bass”, “salmon”, “trout”, “manual expection needed”, etc} Ex 2: Medical diagnosis C X= all the available medical tests, imaging scans that a doctor can order for a patient x =(blood pressure, glucose level, cough, x-ray, etc.) is an illness type ={“Flu”, “cold”, “TB”, “pneumonia”, “lung cancer”, etc} is a decision for treatment, = {“Tylenol”, “Hospitalize”, “more tests needed”, etc} 40 P( j | x) Conditional Risk p(x | j ) P( j ) p(x) c p(x) p(x | j )P( j ) j 1 c c j 1 j 1 R( i | x) ( i | j ) P( j | x) ij P( j | x) Given x, the expected loss (risk) associated with taking action i. 41 0/1 Loss Function c c j 1 j 1 R( i | x) ( i | j ) P( j | x) ij P( j | x) 0 i is a correct decision assiciated with j ( i | j ) P ( 2 x ) x 1 1 otherwise P error | x P ( 1 x ) x 2 R(i | x) P(error | x) 42 Decision c c j 1 j 1 R( i | x) ( i | j ) P( j | x) ij P( j | x) A general decision rule is a function (x) that tells us which action to take for every possible observation. Bayesian Decision Rule: (x) arg min R(i | x) i 43 Overall Risk The overall risk is given by: Compute the conditional risk for every and select the action that minimizes R(i|x). This is denoted R*, and is referred to as the Bayes risk R R( (x) | x) p(x)dx Decision function The Bayes risk is the best performance that can beisachieved. If we choose (x) so that R(i(x)) as small as possible for every x, the overall risk will be minimized. Bayesian decision rule: (x) arg min R(i | x) i the optimal one to minimize the overall risk Its resulting overall risk is called the Bayesian risk 44 Two-Category Classification Let 1 correspond to 1, 2 to 2, and ij = (i|j) State of Nature {1 , 2} The Action {1 , 2 } Loss Function 1 2 1 2 11 12 21 22 conditional risk is given by: R(1 | x) 11P(1 | x) 12 P(2 | x) R( 2 | x) 21P(1 | x) 22 P(2 | x) 45 Our decision rule is: choose 1 if: R(1|x) < R(2|x); otherwise decide 2 Two-Category Classification Perform 1 if R(2|x) > R(1|x); otherwise perform 2 21P(1 | x) 22 P(2 | x) 11P(1 | x) 12 P(2 | x) (21 11) P(1 | x) (12 22 ) P(2 | x) R(1 | x) 11P(1 | x) 12 P(2 | x) R( 2 | x) 21P(1 | x) 22 P(2 | x) 46 Two-Category Classification Perform 1 if R(2|x) > R(1|x); otherwise perform 2 21P(1 | x) 22 P(2 | x) 11P(1 | x) 12 P(2 | x) (21 11) P(1 | x) (12 22 ) P(2 | x) positive positive Posterior probabilities are scaled before comparison. If the loss incurred for making an error is greater than that incurred for being correct, the factors (21- 11) and (12- 22) are positive, and the ratio of these factors simply scales the posteriors. 47 p(x | i ) P(i ) P(i | x) p ( x) irrelevan t Two-Category Classification Perform 1 if R(2|x) > R(1|x); otherwise perform 2 21P(1 | x) 22 P(2 | x) 11P(1 | x) 12 P(2 | x) (21 11) P(1 | x) (12 22 ) P(2 | x) By employing Bayes formula, we can replace the posteriors by the prior probabilities and conditional densities: (21 11) p(x | 1 ) P(1 ) (12 22 ) p(x | 2 ) P(2 ) p(x | 1 ) (12 22 ) P(2 ) p(x | 2 ) (21 11 ) P(1 ) 48 This slide will be recalled later. Two-Category Classification Stated as: Choose a1 if the likelihood ration exceeds a threshold value independent of the observation x. Likelihood Ratio Perform 1 if Threshold p(x | 1 ) (12 22 ) P(2 ) p(x | 2 ) (21 11 ) P(1 ) 49 50 Loss Factors 0 1 1 0 1 1 1 1 1 0 0 1 0 0 1 1 0 1 0 1 If the loss factors are identical, and the prior probabilities are equal, this reduces to a standard likelihood ratio: p( x | 1 ) choose 1 if : 1 p( x | 2 ) 51 MINIMUM ERROR RATE Error rate (the probability of error) is to be minimized Consider a symmetrical or zero-one loss function: 0 i j ( i j ) i , j 1,2 ,..., c 1 i j The conditional risk is: c R( i x ) R( i j )P ( j x) j 1 c P ( j x) ji 1 P ( i x) The conditional risk is the average probability of error. To minimize error, maximize P(i|x) — also known as maximum a posteriori decoding (MAP). 52 Minimum Error Rate LIKELIHOOD RATIO Minimum error rate classification: choose i if: P(i| x) > P(j| x) for all ji 53 Example For sea bass population, the lightness x is a normal random variable distributes according to N(4,1); for salmon population x is distributed according to N(10,1); Select the optimal decision where: a. The two fish are equiprobable b. P(sea bass) = 2X P(salmon) c. The cost of classifying a fish as a salmon when it truly is seabass is 2$, and t The cost of classifying a fish as a seabass when it is truly a salmon is 1$. 54 2 55 56 57 End of Section 1 58 How to define discriminant functions? The Multicategory Classification How do we represent pattern classifiers? The most common way is through discriminant functions. Remember we use {w1,w2, …, wc} to be the possible states of nature. For each class we create a discriminant function gi(x). g1(x) x g2(x) gi(x)’s are called the discriminant functions. Action (x) (e.g., classification) Our classifier is a network or machine that computes c discriminant functions. gc(x) The classifier Assign x to i if gi(x) > gj(x) for all j i. 59 If f(.) is a monotonically increasing function, than f(gi(.) )’s are also be discriminant functions. Simple Discriminant Functions Notice the decision is the same if we change every gi(x) for f(gi(x)) Assuming f(.) is a monotonically increasing function. Minimum Risk case: gi (x) R( i | x) Minimum Error-Rate case: gi (x) P(i | x) gi (x) p(x | i ) P(i ) gi (x) ln p(x | i ) ln P(i ) 60 Figure 2.5 61 Decision Regions The net effect is to divide the feature space into c regions (one for each class). We then have c decision regions separated by decision boundaries. R i {x | g i (x) g j (x) j i} Two-category example Decision regions are separated by decision boundaries. 62 Figure 2.6 63 Bayesian Decision Theory (Classification) The Normal Distribution 64 Basics of Probability Discrete random variable (X) - Assume integer Probability mass function (pmf): Cumulative distribution function (cdf): p ( x) P( X x) F ( x) P( X x) x p(t ) t Continuous random variable (X) Probability density function (pdf): p( x) or f ( x) not a probability x Cumulative distribution function (cdf): F ( x) P( X x) p(t )dt 65 Probability mass function •The graph of a probability mass function. •All the values of this function must be non-negative and sum up to 1 . 66 Probability density function The pdf can be calculated by taking the integral of the function f(x) by the integration interval of the input variable x. For example: the probability of the variable X being within the interval [4.3,7.8] would be 67 Expectations Let g be a function of random variable X. g ( x) p ( x) E[ g ( X )] x g ( x) p ( x)dx The kth moment X is discrete X is continuous E[ X k ] The 1st moment X E[X ] The kth central moments E[( X X ) k ] 68 Fact: Var[ X ] E[ X 2 ] ( E[ X ]) 2 Important Expectations Mean xp( x) X E[ X ] x xp( x)dx X is discrete X is continuous Variance 2 ( x ) p ( x) X 2 2 X Var[ X ] E[( X X ) ] x ( x X ) 2 p( x)dx X is discrete X is continuous 69 Entropy The entropy measures the fundamental uncertainty in the value of points selected randomly from a distribution. p ( x) ln p ( x) H [ X ] x p ( x) ln p ( x)dx X is discrete X is continuous 70 Properties: 1. Maximize the entropy Univariate Gaussian Distribution 2. Central limit theorem X~N(μ,σ2) 1 p ( x) e 2 E[X] =μ p(x) ( x )2 2 2 μ σ 2σ 3σ x σ 2σ 3σ Var[X] =σ2 71 Probability density function of the sum of two terms Probability Probability density density function function ofof thethe sum sum of four of three terms terms Illustration of the central limit theorem Let x1,x2,…,xn be a sequence of n independent and identically distributed random variables having each finite values of expectation µ and variance σ2>0 The central limit theorem states that as the sample size n increases ,the distribution of the sample average of these random variables approaches the normal distribution with a mean µ and variance σ2 / n irrespective of the shape of the original distribution . Original probability density function 72 A probability density function X ( X 1 , X 2 , , X d ) T Random Vectors A d-dimensional random vector Vector Mean: X: R d μ E[ X] ( 1 , 2 ,, d ) Covariance Matrix: Σ E[( X μ)( X μ) ] T 12 12 2 2 21 d 1 d 2 T 1d 2d 732 d 1 p ( x) e 2 ( x )2 2 2 Multivariate Gaussian Distribution X~N(μ,Σ) p ( x) A d-dimensional random vector 1 (2 ) d / 2 | Σ |1/ 2 1 T 1 exp (x μ) Σ (x μ) 2 E[X] =μ E[(X-μ) (X-μ)T] =Σ 74 Properties of N(μ,Σ) X~N(μ,Σ) A d-dimensional random vector Let Y=ATX, where A is a d × k matrix. Y~N(ATμ, ATΣA) 75 Properties of N(μ,Σ) X~N(μ,Σ) A d-dimensional random vector Let Y=ATX, where A is a d × k matrix. Y~N(ATμ, ATΣA) 76 On Parameters of N(μ,Σ) X~N(μ,Σ) X ( X 1 , X 2 , , X d ) μ E[ X] ( 1 , 2 ,, d ) i E[ X i ] T T Σ E[( X μ)( X μ) ] [ ij ]d d T ij E[( X i i )( X j j )] Cov( X i , X j ) ii i2 E[( X i i ) 2 ] Var ( X i ) X i X j ij 0 77 Σ (ΦΛ )(ΦΛ ) 1/ 2 1/ 2 T More On Covariance Matrix is symmetric and positive semidefinite. Σ ΦΛΦ ΦΛ Λ Φ T 1/ 2 1/ 2 T : orthonormal matrix, whose columns are eigenvectors of . : diagonal matrix (eigenvalues). Σ E[( X μ)( X μ) ] [ ij ]d d T ij E[( X i i )( X j j )] Cov( X i , X j ) ii i2 E[( X i i ) 2 ] Var ( X i ) X i X j ij 0 78 Σ (ΦΛ )(ΦΛ ) 1/ 2 1/ 2 T Whitening Transform X~N(μ,Σ) Y=ATX Let Aw ΦΛ Y~N(ATμ, ATΣA) 1 / 2 A w X ~ N ( A μ, A ΣA w ) T w T w ATw ΣA w (ΦΛ1/ 2 )T (ΦΛ1/ 2 )(ΦΛ1/ 2 )T (ΦΛ1/ 2 ) I A w X ~ N ( A μ, I ) T w 79 Σ (ΦΛ )(ΦΛ ) 1/ 2 1/ 2 T Whitening Transform X~N(μ,Σ) Y=ATX Let Aw ΦΛ Whitening Linear Transform T Y~N(A μ, ATΣA) 1 / 2 Projection A w X ~ N ( A μ, A ΣA w ) T w T w ATw ΣA w (ΦΛ1/ 2 )T (ΦΛ1/ 2 )(ΦΛ1/ 2 )T (ΦΛ1/ 2 ) I A w X ~ N ( A μ, I ) T w 80 Whitening Transform •The whitening transformation is a decorrelation method that converts the covariance matrix S of a set of samples into the identity matrix I. •This effectively creates new random variables that are uncorrelated and have the same variances as the original random variables. •The method is called the whitening transform because it transforms the input matrix closer towards white noise . This can be expressed as Aw ΦΛ1/ 2 where Φ is the matrix with the eigenvectors of "S" as its columns and Λ is the diagonal matrix of non-increasing eigenvalues. 81 White noise •White noise is a random signal( or process) with a flat power spectral density . •In other words, the signal contains equal power within a fixed bandwidth at any center frequency . Energy spectral density 82 r 2 (x μ)T Σ1 (x μ) Mahalanobis Distance X~N(μ,Σ) p ( x) 1 (2 ) d / 2 | Σ |1/ 2 depends on constant the value of r2 1 T 1 exp (x μ) Σ (x μ) 2 r2 83 Mahalanobis distance In statistics ,Mahalanobis distance is a distance measure introduced by P. C. Mahalanobis in 1936. It is based on correlations between variables by which different patterns can be identified and analyzed. It is a useful way of determining similarity of an unknown sample set to a known one. It differs from Euclidean distance in that it takes into account the correlations of the data set and is scale-invariant ,i.e. not dependent on the scale of measurements . 84 Mahalanobis distance Formally, the Mahalanobis distance from a group of values with above mean and covariance matrix Σ for a multivariate vector is defined as: Mahalanobis distance can also be defined as dissimilarity measure between two random vectors of the same distribution with the covariance matrix Σ : If the covariance matrix is the identity matrix, the Mahalanobis distance reduces to the Euclidean distance. If the covariance matrix is diagonal, then the resulting distance measure is called the normalized Euclidean distance: where σi is the standard deviation of the xi over the sample set. 85 r 2 (x μ)T Σ1 (x μ) Mahalanobis Distance X~N(μ,Σ) p ( x) 1 (2 ) d / 2 | Σ |1/ 2 depends on constant the value of r2 1 T 1 exp (x μ) Σ (x μ) 2 r2 86 Bayesian Decision Theory (Classification) Discriminant Functions for the Normal Populations 87 Normal Density If features are statistically independent and the variance is the same for all features, the discriminant function is simple and is linear in nature. A classifier that uses linear discriminant functions is called a linear machine. The decision surface are pieces of hyperplanes defined by linear equations. 88 gi (x) P(i | x) gi (x) p(x | i ) P(i ) Minimum-Error-Rate Classification gi (x) ln p(x | i ) ln P(i ) Assuming the measurements are normally distributed, we have Xi~N(μi,Σi) 1 1 T 1 p(x | i ) exp ( x μ ) Σ ( x μ ) i i i 2 (2 ) d / 2 | Σi |1/ 2 1 d 1 T 1 g i (x) (x μ i ) Σ i (x μ i ) ln 2 ln | Σ i | ln P(89i ) 2 2 2 Some Algebra to Simplify the Discriminants Since We take the natural logarithm to rewrite the first term gi ( x) ln p( x | i ) ln P(i ) 1 ( x i ) t i1 ( x i ) 1 2 ln p(x | i ) ln e d /2 1/ 2 2 | | i 1 ( x i ) t i1 ( x i ) 1 2 ln ln e d /2 1/ 2 2 | | i 90 Some Algebra to Simplify the Discriminants (continued) 12 ( x i )t i1 ( x i ) 1 ln e ln d /2 1/ 2 2 | | i 1 d /2 t 1 ( x i ) i ( x i ) ln 2 | i |1/ 2 2 1 d /2 t 1 ( x i ) i ( x i ) ln 2 ln | i |1/ 2 2 1 d 1 t 1 ( x i ) i ( x i ) ln 2 ln | i | 2 2 2 91 Minimum-Error-Rate Classification Three Cases: Case 1: Σ i 2 I Classes are centered at different mean, and their feature components are pairwisely independent have the same variance. Case 2: Σ i Σ Classes are centered at different mean, but have the same variation. Case 3: Σ i Σ j Arbitrary. 1 d 1 T 1 g i (x) (x μ i ) Σ i (x μ i ) ln 2 ln | Σ i | ln P(92i ) 2 2 2 i 2 I is independen t of i Case 1. i = 2 I Σ Since | i | 2 d (a constant), g i ( x) can be reduced to g i ( x) 1 || x μ i ||2 ln P(i ) 2 2 1 2 (xT x 2μTi x μTi μ i ) ln P(i ) 2 t 1 i 1 2 I 2 0 0 0 0 2 ... 0 i 2d 0 ... ... ... 0 ... 2 0 but x x is independen t of i and may be omitted so irrelevant 1 T 1 T g i ( x) 2 μ i x μ i μ i ln P(i ) 2 2 irrelevant d ln 2 is independen t of i and may be omitted 2 1 d 1 T 1 g i (x) (x μ i ) Σ i (x μ i ) ln 2 ln | Σ i | ln P(93i ) 2 2 2 w i 12 μi Case 1. i = g i (x) w Ti x wi 0 2I wi 0 21 2 μTi μi ln P(i ) which is linear in x! 1 T g i ( x) 2 μ x μ μ ln P ( ) i i i 2 2 1 T i 94 w i 12 μi Case 1. i = 2I wi 0 21 2 μTi μi ln P(i ) g i (x) w x wi 0 wTi x wi 0 wTj x w j 0 g i ( x) g j ( x ) (w w )x w j 0 wi 0 T i T j P(i ) (μ μ )x (μ μi μ μ j ) ln P( j ) T i T j T i 1 2 T j 2 (μ μ )x (μ μ )(μ i μ j ) T i T j 1 2 T i j i T i T j Boundary btw. i and j T T ( μ μ i j )(μ i μ j ) 2 || μ i μ j ||2 P(i ) ln P95( j ) The decision boundary will be a hyperplane perpendicular to the line btw. the means at somewhere. Case 1. i = 2 I i w (x x 0 ) 0 T w x0 w μi μ j x 0 12 (μ i μ j ) 2 || μ i μ j ||2 ln P(i ) (μ i μ j ) P( j ) midpoint 0 if P(i)=P(j) wT (μ μ )x (μ μ )(μ i μ j ) T i T j x j xx0 1 2 T i T j g i ( x) g j ( x ) Boundary btw. i and j T T ( μ μ i j )(μ i μ j ) 2 || μ i μ j ||2 P(i ) ln P96( j ) P(1 ) x 0 (μ1 μ 2 ) ln (μ1 μ 2 ) 2 || μ1 μ 2 || P(2 ) 1 2 Case 1. i = 2 2 I The decision region when the priors are equal and the support regions are spherical is simply halfway between the means (Euclidean distance). P(1 ) P(2 ) Minimum distance classifier (template matching) 97 98 P(1 ) x 0 (μ1 μ 2 ) ln (μ1 μ 2 ) 2 || μ1 μ 2 || P(2 ) 1 2 2 Case 1. i = 2I Note how priors shift the boundary away from the more likely mean !!! P(1 ) P(2 ) 99 P(1 ) x 0 (μ1 μ 2 ) ln (μ1 μ 2 ) 2 || μ1 μ 2 || P(2 ) 1 2 2 Case 1. i = 2I P(1 ) P(2 ) 100 P(1 ) x 0 (μ1 μ 2 ) ln (μ1 μ 2 ) 2 || μ1 μ 2 || P(2 ) 1 2 2 Case 1. i = 2I P(1 ) P(2 ) 101 Case 2. i = •Covariance matrices are arbitrary, but equal to each other for all classes. • Features then form hyper-ellipsoidal clusters of equal size and shape. 1 g i (x) (x μ i )T Σ 1 (x μ i ) ln P(i ) 2 Mahalanobis Irrelevant if Distance P(i)= P(j) i, j irrelevant 1 d 1 T 1 g i (x) (x μ i ) Σ i (x μ i ) ln 2 ln | Σ i | ln P(102i ) 2 2 2 w i Σ 1μ i Case 2. i = wi 0 12 μTi Σ 1μ i ln P(i ) 1 g i (x) (x μ i )T Σ 1 (x μ i ) ln P(i ) 2 1 T 1 (x Σ x 2μTi Σ 1x μTi Σ 1μ i ) ln P(i ) 2 Irrelevant g i (x) w Ti x wi 0 103 w i Σ 1μ i Case 2. i = wi 0 12 μTi Σ 1μ i ln P(i ) g i (x) w Ti x wi 0 i The discriminant hyperplanes are often not orthogonal to the segments joining the class means x w j x0 w T (x x 0 ) 0 g i ( x) g j ( x ) 1 w Σ (μi μ j ) x 0 (μ i μ j ) 1 2 ln[ P(i ) / P( j )] 1 (μ i μ j ) Σ (μ i μ j ) T (μ i μ j ) 104 Case 2. i = 105 Case 2. i = 106 1 1 w Σ 1μ w 1 μT Σ 1μ 1 ln | Σ 1 | ln P( ) i i i i0 i i i i 2 i 2 Wi Σ i 2 Case 3. i j The covariance matrices are different for each category In two class case, the decision boundaries form hyperquadratics. 1 1 T 1 g i (x) (x μ i ) Σ i (x μ i ) ln | Σ i | ln P(i ) 2 2 g i (x) x Wi x w x wi 0 T T i Without this term In Case 1 and 2 Decision surfaces are hyperquadrics, e.g., • hyperplanes • hyperspheres • hyperellipsoids • hyperhyperboloids irrelevant 1 d 1 T 1 g i (x) (x μ i ) Σ i (x μ i ) ln 2 ln | Σ i | ln P(107i ) 2 2 2 Case 3. i j Non-simply connected decision regions can arise in one dimensions for Gaussians having unequal variance. 108 Case 3. i j 109 Case 3. i j 110 Case 3. i j 111 Multi-Category Classification 112 Example : A Problem Exemplars (transposed) For 1 = {(2, 6), (3, 4), (3, 8), (4, 6)} For 2 = {(1, -2), (3, 0), (3, -4), (5, -2)} Calculated means (transposed) m1 = (3, 6) m2 = (3, -2) 10 8 6 Class 1 4 x2 Class 2 2 m1 0 -2 m2 0 4 2 6 -4 -6 x1 113 Example : Covariance Matrices 1 1 2 0 1 / 2 0 1 ai 1 ai 1 0 8 0 2 4 4 i 1 2 3 1 1 0 a1 1 a1 1 a1 1 t 6 6 0 0 0 3 3 0 0 0 t a2 1 a2 1 a2 1 4 6 2 0 4 4 a3 1 t 3 3 0 0 t a a 3 1 3 1 6 2 0 8 4 3 1 1 a4 1 a4 1 a4 1 t 6 6 0 0 0 4 0 0 114 Example : Covariance Matrices 1 1 8 0 2 0 2 bi 2 bi 2 0 2 0 8 4 4 i 1 1 3 2 4 0 b1 2 b1 2 b1 2 t 2 2 0 0 0 3 3 0 0 0 t b2 2 b2 2 b2 2 0 2 2 0 4 4 b3 2 t 3 3 0 0 t b b 3 2 3 2 2 2 0 4 5 3 2 4 t b4 2 b4 2 b4 2 2 2 0 0 0 4 0 0 115 Example: Inverse and Determinant for Each of the Covariance Matrices 1 / 2 0 2 0 1 1 1 and 1 1 0 2 0 1 / 2 2 0 1 / 2 0 1 2 2 and 2 4 0 2 0 1 / 2 116 Example : A Discriminant Function for Class 1 Since 1 1 g i ( x) ( x i )t i1 ( x i ) ln | i | ln P(i ) 2 2 2 0 x1 3 1 1 g1 ( x) x1 3 x2 6 ln 1 ln P(1 ) 2 0 1 / 2 x2 6 2 x2 x1 3 3 ln P(1 ) 2 x2 6 1 x2 2 x1 6x1 3 3 x2 6 ln P(1 ) 2 2 1 2 x1 6 2 117 Example 1 x g1 ( x) 2 x1 6 x1 3 2 3 x2 6 ln P(1 ) 2 2 1 2 x22 2 x1 12 x1 6 x2 36 ln P(1 ) 2 2 x22 x 6 x1 3x2 18 ln P(1 ) 4 2 1 118 Example : A Discriminant Function for Class 2 Also, 1 1 g i ( x) ( x i )t i1 ( x i ) ln | i | ln P(i ) 2 2 1 / 2 0 x1 3 1 1 g 2 ( x) x1 3 x2 2 ln 4 ln P(2 ) 2 0 1 / 2 x2 2 2 x1 3 1 ( x1 3) ( x2 2) ln 2 ln P(2 ) 4 x2 2 1 2 2 x1 3 x2 2 ln 2 ln P(2 ) 4 119 Example Hence, 1 x1 32 x2 22 ln 2 ln P(2 ) 4 1 x12 6 x1 x22 4 x2 13 ln 2 ln P(2 ) 4 x12 3 x22 13 x1 x2 ln 2 ln P(2 ) 4 2 4 4 g 2 ( x) 120 Example : The Class Boundary Assuming P(ω1 ) P(ω2 ) and setting g1(x) g 2(x), 2 x g1 ( x) x12 6 x1 2 3x2 18 ln P(1 ) 4 x12 3 x22 13 g 2 ( x) x1 x2 ln 2 ln P(2 ) 4 2 4 4 2 2 2 x x 3 x 13 2 2 1 2 x1 6 x1 3x2 18 x1 x2 ln 2 4 4 2 4 4 3x12 9 13 4 x2 x1 18 ln 2 4 2 4 121 Example : A Quadratic Separator 3x12 9 13 4 x2 x1 18 ln 2 4 2 4 3x12 9 13 18 ln 2 x2 x1 16 8 16 4 x2 0.1875 x12 1.125 x1 3.514, (a quadratic! ) 122 Example : Plot of the Discriminant 10 8 6 x2 Class 1 4 Class 2 2 m1 m2 0 -2 0 2 4 6 8 Discriminant -4 -6 x1 123 Summary Steps for Building a Bayesian Classifier Collect class exemplars Estimate class a priori probabilities Estimate class means Form covariance matrices, find the inverse and determinant for each Form the discriminant function for each class 124 Using the Classifier Obtain a measurement vector x Evaluate the discriminant function gi(x) for each class i = 1,…,c Decide x is in the class j if gj(x) > gi(x) for all i j 125 Bayesian Decision Theory (Classification) Criterions 126 Bayesian Decision Theory (Classification) Minimun error rate Criterion 127 Minimum-Error-Rate Classification If action i is taken and the true state is j , then the decision is correct if i j and in error if i j Error rate (the probability of error) is to be minimized Symmetrical or zero-one loss function 0, i j ( i | j ) , i, j 1, , c 1, i j Conditional risk c R( i | x) ( i | j ) P( j | x) 1 P(i | x) j 1 128 Minimum-Error-Rate Classification 129 Bayesian Decision Theory (Classification) Minimax Criterion 130 Bayesian Decision Rule: Two-Category Classification Likelihood Ratio Decide 1 if Threshold p(x | 1 ) (12 22 ) P(2 ) p(x | 2 ) (21 11 ) P(1 ) Minimax criterion deals with the case that the prior probabilities are unknown. 131 Basic Concept on Minimax To choose the worst-case prior probabilities (the maximum loss) and, then, pick the decision rule that will minimize the overall risk. Minimize the maximum possible overall risk. So that the worst risk for any value of the priors is as small as possible 132 R(1 | x) 11P(1 | x) 12 P(2 | x) R( 2 | x) 21P(1 | x) 22 P(2 | x) Overall Risk R R( (x) | x) p(x)dx R(1 | x) p(x)dx R( 2 | x) p(x)dx R1 R2 R [11P(1 | x) 12 P(2 | x)] p (x)dx R1 R2 [21P(1 | x) 22 P(2 | x)] p(x)dx 133 Overall Risk p(x | i ) P(i ) P(i | x) p ( x) R [11P(1 ) p(x | 1 ) 12 P(2 ) p(x | 2 )]dx R1 R2 [21P(1 ) p(x | 1 ) 22 P(2 ) p (x | 2 )]dx R [11P(1 | x) 12 P(2 | x)] p (x)dx R1 R2 [21P(1 | x) 22 P(2 | x)] p(x)dx 134 Overall Risk P(2 ) 1 P(1 ) R [11P(1 ) p(x | 1 ) 12 P(2 ) p(x | 2 )]dx R1 R2 [21P(1 ) p(x | 1 ) 22 P(2 ) p (x | 2 )]dx R {11P(1 ) p (x | 1 ) 12[1 P(1 )] p (x | 2 )}dx R1 R2 {21P(1 ) p(x | 1 ) 22[1 P(1 )] p (x | 2 )}dx R 12 p(x | 2 )dx 22 p(x | 2 )dx R1 R2 11 P(1 ) p(x | 1 )dx 12 P(1 ) p(x | 2 )dx R1 R1 21 P(1 ) p(x | 1 )dx 22 P(1 ) p(x | 2 )dx R2 R2 135 Overall Risk R1 p(x | i )dx p(x | i )dx 1 R2 R[ P(1 )] 22 (12 22 ) p(x | 2 )dx R1 P(1 ) (11 22 ) (21 11 ) p(x | 1 )dx (12 22 ) p(x | 2 )dx R2 R1 R 12 p(x | 2 )dx 22 p(x | 2 )dx R1 R2 11 P(1 ) p(x | 1 )dx 12 P(1 ) p(x | 2 )dx R1 R1 21 P(1 ) p(x | 1 )dx 22 P(1 ) p(x | 2 )dx R2 R2 136 Overall Risk R(x) = ax + b The value depends on the setting of decision boundary R[ P(1 )] 22 (12 22 ) p(x | 2 )dx R1 P(1 ) (11 22 ) (21 11 ) p(x | 1 )dx (12 22 ) p(x | 2 )dx R2 R1 The value depends on the setting of decision boundary The overall risk for a particular P(1). 137 Overall Risk R(x) = ax + b = Rmm, minimax risk R[ P(1 )] 22 (12 22 ) p(x | 2 )dx R1 P(1 ) (11 22 ) (21 11 ) p(x | 1 )dx (12 22 ) p(x | 2 )dx R2 R1 = 0 for minimax solution Independent on the value of P(i). 138 Minimax Risk R[ P(1 )] 22 (12 22 ) p(x | 2 )dx R1 P(1 ) (11 22 ) (21 11 ) p(x | 1 )dx (12 22 ) p(x | 2 )dx R2 R1 Rmm 22 (12 22 ) p(x | 2 )dx R1 11 (21 11 ) p(x | 1 )dx R2 139 Use 0/1 loss function Error Probability R[ P(1 )] 22 (12 22 ) p(x | 2 )dx R1 P(1 ) (11 22 ) (21 11 ) p(x | 1 )dx (12 22 ) p(x | 2 )dx R2 R1 Perror[ P(1 )] p(x | 2 )dx R1 P(1 ) p(x | 1 )dx p(x | 2 )dx R 2 R1 140 Use 0/1 loss function Minimax Error-Probability Pmm (error ) p (x | 2 )dx p(x | 1 )dx R2 R1 P(1|2) P(2|1) Perror[ P(1 )] p(x | 2 )dx R1 P(1 ) p(x | 1 )dx p(x | 2 )dx R 2 R1 141 Perror[ P(1 )] p(x | 2 )dx R1 P(1 ) p(x | 1 )dx p(x | 2 )dx R 2 R1 Minimax Error-Probability Pmm (error ) p (x | 2 )dx p(x | 1 )dx R2 R1 P(1|2) P(2|1) 1 2 R1 R2 142 Perror[ P(1 )] p(x | 2 )dx R1 P(1 ) p(x | 1 )dx p(x | 2 )dx R 2 R1 Minimax Error-Probability 143 Bayesian Decision Theory (Classification) Neyman-Pearson Criterion 144 Bayesian Decision Rule: Two-Category Classification Likelihood Ratio Decide 1 if Threshold p(x | 1 ) (12 22 ) P(2 ) p(x | 2 ) (21 11 ) P(1 ) Neyman-Pearson Criterion deals with the case that both loss functions and the prior probabilities are unknown. 145 Signal Detection Theory The theory of signal detection theory evolved from the development of communications and radar equipment the first half of the last century. It migrated to psychology, initially as part of sensation and perception, in the 50's and 60's as an attempt to understand some of the features of human behavior when detecting very faint stimuli that were not being explained by traditional theories of thresholds. 146 The situation of interest A person is faced with a stimulus (signal) that is very faint or confusing. The person must make a decision, is the signal there or not. What makes this situation confusing and difficult is the presences of other mess that is similar to the signal. Let us call this mess noise. 147 Example Noise is present both in the environment and in the sensory system of the observer. The observer reacts to the momentary total activation of the sensory system, which fluctuates from moment to moment, as well as responding to environmental stimuli, which may include a signal. 148 Signal Detection Theory Suppose we want to detect a single pulse from a signal. We assume the signal has some random noise. When the signal is present we observe a normal distribution with mean u2. When the signal is not present we observe a normal distribution with mean u1. We assume same standard deviation. Can we measure the discriminability of the problem? Can we do this independent of the threshold x*? Discriminability: d’ = | u2 – u1 | / σ 149 Example A radiologist is examining a CT scan, looking for evidence of a tumor. A Hard job, because there is always some uncertainty. There are four possible outcomes: hit (tumor present and doctor says "yes'') miss (tumor present and doctor says "no'') Two types false alarm (tumor absent and doctor says "yes") of Error correct rejection (tumor absent and doctor says 150 "no"). Signal detection theory was developed to help us understand how a continuous and ambiguous signal can lead to a binary yes/no decision. The Four Cases Signal (tumor) No (1) Absent (1) Present (2) P(1|1) P(1|2) Correct Rejection Miss P(2|1) P(2|2) False Alarms Hit Decision Yes (2) 151 Discriminability d' | 2 1 | Decision Making Criterion d’ Noise Based on expectancy (decision bias) Noise + Signal Hit P(2|2) False P(2|1) Alarm 1 No (1) 2 Yes (2) 152 153 Signal Detection Theory How do we find d’ if we do not know u1, u2, or x*? From the data we can compute: 1. P( x > x* | w2) a hit. 2. P( x > x* | w1) a false alarm. 3. P( x < x* | w2) a miss. 4. P( x < x* | w1) a correct rejection. If we plot a point in a space representing hit and false alarm rates, then we have a ROC (receiver operating characteristic) curve. With it we can distinguish between discriminability and bias. 154 ROC Curve (Receiver Operating Characteristic) Hit PH=P(2|2) False Alarm 155 PFA=P(2|1) Neyman-Pearson Criterion Hit PH=P(2|2) NP: max. PH subject to PFA ≦ a False Alarm 156 PFA=P(2|1)