Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ECES 481/681 – Statistical Pattern Recognition Andrew Cohen Drexel University Lecture 1-Bayesian classification 1 Today Syllabus ground rules Prerequisite Grading Projects! Schedule Attendance is mandatory Lecture 1 Theoridis and Koutroumbas 2 Course slides adapted from textbook: Sergios Theodoridis Konstantinos Koutroumbas Version 3 3 PATTERN RECOGNITION Typical application areas Machine vision Character recognition (OCR) Computer aided diagnosis Speech recognition Face recognition Biometrics Image Data Base retrieval Data mining Bionformatics The task: Assign unknown objects – patterns – into the correct class. This is known as classification. 4 What is pattern recognition? Pattern recognition Data mining Machine learning Computer vision Voice and image processing Neural networks and fuzzy logic Information theory (applied) probability and statistics Artificial intelligence … and many more 5 An example 6 Retinal stem cell time sequence of feature vectors DatasetName: 'DCC43 XY point 51 ; 11' RawFeatures: [363x8 double] TimeStamp: [363x1 double] idxTrue: 2 idxCell: 1 Features: [362x6 double] FeatureLabels: {'[dxy]' 'tot_dist' 'phi_dist' 'e' 's'} 7 Features: These are measurable quantities obtained from the patterns, and the classification task is based on their respective values. Feature vectors: A number of features x1 ,..., xl , constitute the feature vector T l x x1 ,..., xl R Feature vectors are treated as random vectors. 8 An example: 9 The classifier consists of a set of functions, whose values, computed at x , determine the class to which the corresponding pattern belongs Classification system overview Patterns sensor feature generation feature selection classifier design system evaluation 10 Supervised – unsupervised – semisupervised pattern recognition: The major directions of learning are: Supervised: Patterns whose class is known a-priori are used for training. Unsupervised: The number of classes/groups is (in general) unknown and no training patterns are available. Semisupervised: A mixed type of patterns is available. For some of them, their corresponding class is known and for the rest is not. 11 Chapter 2 - CLASSIFIERS BASED ON BAYES DECISION THEORY Statistical nature of feature vectors x x1 , x2 ,..., xl T Assign the pattern represented by feature vector x to the most probable of the available classes 1 , 2 ,...,M That is x i : P(i x) maximum 12 Computation of a-posteriori probabilities Assume known • a-priori probabilities P(1 ), P(2 )..., P(M ) • p( x i ), i 1,2,..., M This is also known as the likelihood of x w.r. to i . 13 The Bayes rule (Μ=2) p ( x) P(i x) p ( x i ) P(i ) P(i x) where p ( x i ) P(i ) p ( x) 2 p ( x) p ( x i ) P(i ) i 1 14 The Bayes classification rule (for two classes M=2) Given x classify it according to the rule If P(1 x ) P(2 x ) x 1 If P(2 x ) P(1 x ) x 2 Equivalently: classify x according to the rule p( x 1 ) P(1 )() p( x 2 ) P(2 ) For equiprobable classes the test becomes p( x 1 )() p( x 2 ) 15 R1 ( 1 ) and R2 ( 2 ) 16 Equivalently in words: Divide space in two regions If x R1 x in 1 If x R2 x in 2 Probability of error Total shaded area x0 1 1 P p ( x ) dx e 2 2 2 p ( x ) dx 1 x0 Bayesian classifier is OPTIMAL with respect to minimising the classification error probability!!!! 17 Indeed: Moving the threshold the total shaded area INCREASES by the extra “grey” area. 18 The Bayes classification rule for many (M>2) classes: Given x classify it to i if: P(i x) P( j x) j i Such a choice also minimizes the classification error probability Minimizing the average risk For each wrong decision, a penalty term is assigned since some decisions are more sensitive than others 19 For M=2 • Define the loss matrix 11 12 L( ) 21 22 • 12 penalty term for deciding class 2 , although the pattern belongs to 1 , etc. Risk with respect to 1 r1 11 p( x 1 )d x 12 p( x 1 )d x R1 R2 20 Risk with respect to 2 r2 21 p( x 2 )d x 22 p( x 2 )d x R1 R2 Probabilities of wrong decisions, weighted by the penalty terms Average risk r r1P(1 ) r2 P(2 ) 21 Choose R1 and R2 so that r is minimized i if 1 11 p( x 1 ) P(1 ) 21 p( x 2 ) P(2 ) Then assign x to 2 12 p( x 1 ) P(1 ) 22 p( x 2 ) P(2 ) Equivalently: assign x in 1 (2 ) if p ( x 1 ) P(2 ) 21 22 12 () p ( x 2 ) P(1 ) 12 11 12 : likelihood ratio 22 If 1 P(1 ) P(2 ) and 11 22 0 2 21 x 1 if P( x 1 ) P( x 2 ) 12 12 x 2 if P( x 2 ) P( x 1 ) 21 if 21 12 Minimum classifica tion error probabilit y 23 An example: p( x 1 ) 1 p( x 2 ) 1 exp( x ) 2 exp( ( x 1) 2 ) 1 P(1 ) P(2 ) 2 0 0.5 L 1.0 0 24 Then the threshold value is: x0 for minimum Pe : x0 : exp( x 2 ) exp( ( x 1) 2 ) 1 x0 2 Threshold x̂0 for minimum r xˆ0 : exp ( x 2 ) 2 exp (( x 1) 2 ) (1 n2) 1 xˆ0 2 2 25 1 x0 Thus x̂0 moves to the left of 2 (WHY?) 26 DISCRIMINANT FUNCTIONS DECISION SURFACES If Ri , R j are contiguous: g ( x) P(i x) P( j x) 0 Ri : P(i x) P( j x) + - g ( x) 0 R j : P( j x) P(i x) is the surface separating the regions. On the one side is positive (+), on the other is negative (-). It is known as Decision Surface. 27 If f (.) monotonically increasing, the rule remains the same if we use: x i if : f (P(i x)) f (P( j x)) i j gi ( x) f ( P(i x)) is a discriminant function. In general, discriminant functions can be defined independent of the Bayesian rule. They lead to suboptimal solutions, yet, if chosen appropriately, they can be computationally more tractable. Moreover, in practice, they may also lead to better solutions. This, for example, may be case if the nature of the underlying pdf’s are unknown. 28 THE GAUSSIAN DISTRIBUTION The one-dimensional case p ( x) ( x )2 1 exp 2 2 2 where is the mean value, i.e.: E x xp( x)dx 2 is the variance, 2 E ( x E x ) 2 2 ( x ) p ( x)dx 29 The Multivariate (Multidimensional) case: p ( x) 1 2 (2 ) 1 2 1 exp ( x )T 1 ( x ) 2 where is the mean value, E x and is known as the covariance matrix and it is defined as: E[( x )( x )T ] An example: The two-dimensional case: 1 x 1 1 1 1 p( x) px1 , x2 exp x1 1 , x2 2 1 x2 2 2 2 2 x1 1 12 1 Ex1 x1 1 , x2 2 E , x2 2 2 Ex2 where E( x1 1 )( x2 2 ) 22 30 BAYESIAN CLASSIFIER FOR NORMAL DISTRIBUTIONS Multivariate Gaussian pdf p ( x i ) 1 2 (2 ) i 1 2 1 exp ( x i ) i1 ( x i ) 2 i E x is an 1 vector, for x i i E ( x i )( x i ) is the covariance matrix. 31 ln( ) is monotonic. Define: g i ( x) ln( p( x i ) P(i )) ln p( x i ) ln P(i ) 1 T 1 g i ( x) ( x i ) i ( x i ) ln P (i ) Ci 2 1 Ci ( ) ln 2 ( ) ln i 2 2 Example: 2 0 i 2 0 32 g i ( x) 1 2 2 1 2 2 (x x ) 2 1 2 2 1 2 ( i1 x1 i 2 x2 ) ( i21 i22 ) ln( Pi ) Ci That is, g i (x) is quadratic and the surfaces g i ( x) g j ( x) 0 quadrics, ellipsoids, parabolas, hyperbolas, pairs of lines. 33 Example 1: Example 2: 34 Decision Hyperplanes Quadratic terms: x x T 1 i If ALL Σ i Σ (the same) the quadratic terms are not of interest. They are not involved in comparisons. Then, equivalently, we can write: g i ( x) wi x wio T wi Σ 1 i 1 Τ 1 wi 0 ln P(i ) i Σ i 2 Discriminant functions are LINEAR. 35 Let in addition: • Σ 2 I . Then g i ( x) • 1 2 Ti x wi 0 g ij ( x) g i ( x) g j ( x) 0 w ( x xo ) T • w i j, • P (i ) i j 1 2 x o ( i j ) ln P( j ) 2 2 j i 36 Remark : 1 • If p(1 ) p(2 ) , then x 0 1 2 2 37 • If p 1 p 2 , the linear classifier moves towards the class with the smaller probability 38 Nondiagonal: 2 • gij ( x) w ( x x 0 ) 0 • w ( i j ) • T 1 i j P(i ) 1 x 0 ( i j ) n ( ) 2 P( j ) 2 i j 1 where x ( x 1 x) T 1 Decision hyperplane 1 2 not normal to i j normal to 1 ( i j ) 39 Minimum Distance Classifiers 1 P (i ) equiprobable M 1 T 1 g ( x ) ( x ) (x i ) i i 2 2 I : Assign x i : Euclidean Distance: d E x i smaller I : Assign x i : 2 Mahalanobis Distance: d m (( x i ) ( x i )) smaller T 1 1 2 40 41 Example: Given 1 , 2 : P(1 ) P(2 ) and p( x 1 ) N ( 1 , Σ ), 0 3 1.1 0.3 p( x 2 ) N ( 2 , Σ ), 1 , 2 , 0 3 0 . 3 1 . 9 1.0 classify t he vector x using Bayesian classifica tion : 2.2 0.95 0.15 Σ 0 . 15 0 . 55 Compute Mahalanobi s d m from 1 , 2 : d 2 m,1 1.0, 2.2 -1 1.0 2 1 2.0 Σ 2.952, d m, 2 2.0, 0.8 3.672 2.2 0.8 1 Classify x 1. Observe that d E ,2 d E ,1 42 ESTIMATION OF UNKNOWN PROBABILITY DENSITY FUNCTIONS Maximum Likelihood Let x , x ,...., x known and independen t 1 2 N Let p( x) known with in an unknown ve ctor parameter : p( x) p( x; ) X x1 , x 2 ,...x N p( X ; ) p( x1 , x 2 ,...x N ; ) N p ( x k ; ) k 1 which is known as the Likelihood of w.r. to X The method : 43 Ν ˆ ML : arg max p ( x k ; ) k 1 N L( ) ln p ( X ; ) ln p ( x k ; ) k 1 N p ( x ) L ( ) 1 k ; 0 ˆ ML : ( ) k 1 p ( x k ; ) ( ) 44 45 If, indeed, there is a 0 such that p ( x) p ( x; 0 ), then lim E[ˆ ] N ML lim E ˆ ML 0 N 0 2 0 Asymptotically unbiased and consistent 46 Estimation Bias And Consistency The bias of an estimator is the difference between this estimator's expected value and the true value of the parameter being estimated bias(ˆ) E[ˆ] E[ˆ ] An estimator is said to be consistent if it converges in probability to the true value of the parameter 47 Example: p ( x) : N ( , Σ ) : unknown, x1 , x 2 ,..., x N p ( x k ) p ( x k ; ) 1 N L( ) ln p ( x k ; ) C ( x k )T Σ 1 ( x k ) k 1 2 k 1 1 1 T 1 p( x k ; ) exp( ( x ) Σ ( x k )) k l 1 2 2 2 (2 ) Σ N L 1 . N N L( ) 1 1 . Σ ( x k ) 0 ML x k k 1 N k 1 . L l ( A ) T Remember : if A A 2 A T 48 Maximum Aposteriori Probability Estimation In ML method, θ was considered as a parameter Here we shall look at θ as a random vector described by a pdf p(θ), assumed to be known Given X x1 , x 2 ,..., x N Compute the maximum of p( X ) From Bayes theorem p( ) p( X ) p( X ) p( X ) or p( X ) p( ) p( X ) p( X ) 49 The method: ˆ MAP arg max p( X ) or ˆ ( P( ) p ( X )) MAP : If p ( ) is uniform or broad enough ˆ MAP ML 50 51 Example: p( x) : N ( , 2 Ι ), unknown, X x1,...,x N p( ) 1 l 2 (2 ) l exp( 0 2 2 2 ) N N 1 1 MAP : ln( p( x k ) p( )) 0 or 2 ( x k ) 2 ( ˆ 0 ) 0 k 1 k 1 2 N 2 0 2 xk k 1 ˆ MAP For 2 1, or for N 2 1 2 N 1 N ˆ MAP ˆ ML x k N k 1 52 Nonparametric Estimation k N in h N total 1 kN h pˆ ( x) pˆ ( xˆ ) , x - xˆ h N 2 kN P N h x̂ 2 x̂ h x̂ 2 In words : Place a segment of length h at x̂ and count points inside it. If p (x ) is continuous: p( x) p( x) as N , if hN 0, k N , kN 0 N 53 Parzen Windows Place at x a hypercube of length points inside. h and count 54 Define 1 1 xij ( xi ) 2 0 otherwise • That is, it is 1 inside a unit side hypercube centered at 0 1 1 ˆ • p( x) l ( h N • xi x ( )) h i 1 N 1 1 * * number of points inside volume N an h - side hypercube centered at x • The problem: p ( x) continuous (.) discontinu ous • Parzen windows-kernels-potential functions ( x) is smooth ( x) 0, ( x)d x 1 x 55 Mean value 1 1 E[ pˆ ( x)] l ( h N N E[ ( i 1 xi x 1 x' x )]) l ( ) p ( x' )d x' h h h x' 1 • h 0, l h x' x )0 • h 0 the width of ( h The bias of an estimator 1 x ' x is the difference between )d x 1 • l ( this estimator's expected h h value and the true value 1 x of the parameter being • h 0 l ( ) ( x) estimated h h E[ pˆ ( x)] ( x' x) p( x' )d x' p( x) x' Hence unbiased in the limit 56 Variance • The smaller the h the higher the variance h=0.1, N=1000 h=0.8, N=1000 57 h=0.1, N=10000 The higher the N the better the accuracy 58 If • h0 • N • hN asymptotically unbiased The method • Remember: p ( x 1 ) P (2 ) 21 22 l12 () p( x 2 ) P (1 ) 12 11 • 1 N1h l 1 N 2 hl xi x ( ) h i 1 () N2 xi x ( ) h i 1 N1 59 CURSE OF DIMENSIONALITY In all the methods, so far, we saw that the highest the number of points, N, the better the resulting estimate. If in the one-dimensional space an interval, filled with N points, is adequate (for good estimation), in the two-dimensional space the corresponding square will require N2 and in the ℓ-dimensional space the ℓdimensional cube will require Nℓ points. The exponential increase in the number of necessary points in known as the curse of dimensionality. This is a major problem one is confronted with in high dimensional spaces. 60 An Example : 61 NAIVE – BAYES CLASSIFIER Let x and the goal is to estimate px | i i = 1, 2, …, M. For a “good” estimate of the pdf one would need, say, Nℓ points. Assume x1, x2 ,…, xℓ mutually independent. Then: px | i px j | i j 1 In this case, one would require, roughly, N points for each pdf. Thus, a number of points of the order N·ℓ would suffice. It turns out that the Naïve – Bayes classifier works reasonably well even in cases that violate the independence assumption. 62 K Nearest Neighbor Density Estimation In Parzen: • The volume is constant • The number of points in the volume is varying Now: • Keep the number of points constant kN k • Leave the volume to be varying k ˆ ( x) •p NV ( x) 63 • k N 2V2 N1V1 () k N1V1 N 2V2 64 The Nearest Neighbor Rule Choose k out of the N training vectors, identify the k nearest ones to x Out of these k identify ki that belong to class ωi Assign x i : ki k j i j The simplest version k=1 !!! For large N this is not bad. It can be shown that: if PB is the optimal Bayesian error probability, then: PB PNN M PB (2 PB ) 2 PB M 1 65 PB PkNN PB 2 PNN k k , PkNN PB For small PB: PNN 2 PB P3 NN PB 3( PB ) 2 An example: 66 Voronoi tesselation Ri x : d ( x, x i ) d ( x, x j ) i j 67 BAYESIAN NETWORKS Bayes Probability Chain Rule p( x1 , x2 ,..., x ) p( x | x 1 ,..., x1 ) p( x 1 | x 2 ,..., x1 ) ... ... p( x2 | x1 ) p( x1 ) Assume now that the conditional dependence for each xi is limited to a subset of the features appearing in each of the product terms. That is: p( x1 , x2 ,..., x ) p( x1 ) p( xi | Ai ) i 2 where Ai xi 1 , xi 2 ,..., x1 68 For example, if ℓ=6, then we could assume: p( x6 | x5 ,..., x1 ) p( x6 | x5 , x4 ) Then: A6 x5 , x4 x5 ,..., x1 The above is a generalization of the Naïve – Bayes. For the Naïve – Bayes the assumption is: Ai = Ø, for i=1, 2, …, ℓ 69 A graphical way to portray conditional dependencies is given below According to this figure we have that: • x6 is conditionally dependent on x4, x5 • x5 on x4 • x4 on x1, x2 • x3 on x2 • x1, x2 are conditionally independent on other variables. For this case: p( x1 , x2 ,..., x6 ) p( x6 | x5 , x4 ) p( x5 | x4 ) p( x3 | x2 ) p( x2 ) p( x1 ) 70 Bayesian Networks Definition: A Bayesian Network is a directed acyclic graph (DAG) where the nodes correspond to random variables. Each node is associated with a set of conditional probabilities (densities), p(xi|Ai), where xi is the variable associated with the node and Ai is the set of its parents in the graph. A Bayesian Network is specified by: • The marginal probabilities of its root nodes. • The conditional probabilities of the non-root nodes, given their parents, for ALL possible values of the involved variables. 71 The figure below is an example of a Bayesian Network corresponding to a paradigm from the medical applications field. This Bayesian network models conditional dependencies for an example concerning smokers (S), tendencies to develop cancer (C) and heart disease (H), together with variables corresponding to heart (H1, H2) and cancer (C1, C2) medical tests. 72 Once a DAG has been constructed, the joint probability can be obtained by multiplying the marginal (root nodes) and the conditional (non-root nodes) probabilities. Training: Once a topology is given, probabilities are estimated via the training data set. There are also methods that learn the topology. Probability Inference: This is the most common task that Bayesian networks help us to solve efficiently. Given the values of some of the variables in the graph, known as evidence, the goal is to compute the conditional probabilities for some of the other variables, given the evidence. 73 Example: Consider the Bayesian network of the figure: a) If x is measured to be x=1 (x1), compute P(w=0|x=1) [P(w0|x1)]. b) If w is measured to be w=1 (w1) compute P(x=0|w=1) [ P(x0|w1)]. 74 For a), a set of calculations are required that propagate from node x to node w. It turns out that P(w0|x1) = 0.63. For b), the propagation is reversed in direction. It turns out that P(x0|w1) = 0.4. In general, the required inference information is computed via a combined process of “message passing” among the nodes of the DAG. Complexity: For singly connected graphs, message passing algorithms amount to a complexity linear in the number of nodes. 75 Bayesian networks and functional graphical models Pearl, J., Causality: Models, Reasoning, and Inference 2000: Cambridge University Press. 76 Homework Due by beginning of class Tuesday 4/11/17 Read chapter 1&2 of the text Do text computer experiments 2.7,2.8 PP 85-86 Read sections 1 through 4 of “On The Mathematical Foundations Of Theoretical Statistics” (posted on course website) What is a sufficient statistic for the Gaussian, Poisson, and exponential distributions? Express the mean and variance for these distributions in terms of the sufficient statistic. Next: chapter 3 77 Instructor Contact Information Andrew R. Cohen Associate Prof. Department of Electrical and Computer Engineering Drexel University 3120 – 40 Market St., Suite 110 Philadelphia, PA 19104 office phone: (215) 571 – 4358 http://bioimage.coe.drexel.edu/courses [email protected]