Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
EE 4780 Pattern Classification Classification Example Goal: Automatically classify incoming fish according to species, and send to respective packing plants. Features: Length, width, color, brightness, etc. Model: Sea bass have some typical length, and it is greater than that for salmon. Classifier: If the fish is longer than a value, l*, classify it as sea bass. Training Samples: To choose l*, make length measurements from training samples and inspect the results. Bahadir K. Gunturk 2 Classification Example Bahadir K. Gunturk 3 Classification Example Bahadir K. Gunturk 4 Classification Example Decision boundary Bahadir K. Gunturk Now, we have two features two classify the fish: the lightness x1, and the width x2. Feature vector: x=[x1 x2]’. The feature extractor reduces the image of a fish to a feature vector x in a 2D feature space. 5 Classification Example Bahadir K. Gunturk 6 Classification Example Bahadir K. Gunturk 7 Feature Extraction The goal of feature extractor is to characterize an object to be recognized by measurements whose values are very similar for objects in the same category, and very different for objects in different categories. The features should be invariant to the irrelevant transformation of the input. For example, the location of a fish on the belt is irrelevant, and thus the representation should be insensitive to the location of the fish. Bahadir K. Gunturk 8 Classification The task of the classifier is to use feature vectors (provided by the feature extractor) to assign the object to a category. Perfect classification is often impossible, a more general task is to determine the probability for each of the possible categories. The process of using data to determine the classifier is referred to as training the classifier. Bahadir K. Gunturk 9 Classical Model Raw Data Feature Extractor x1 x2 • • • Classifier xd Class 1 or 2 or ….. or c We measure a fixed set of d features for an object that we want to classify. For example, Bahadir K. Gunturk x1 = height x2 = perimeter ... xd = average pixel intensity 10 Feature Vectors We can think of our feature set as a feature vector x, where x is the d-dimensional column vector x= • • • xd x3 x1 x2 x x1 x2 Can think of x as being a point in a d-dimensional feature space. By this process of feature measurement, we can represent an object as a point in feature space. Bahadir K. Gunturk 11 What is ahead Bahadir K. Gunturk Template matching Minimum-distance classifiers Metrics Inner products Linear discriminants Bayesian approach 12 Template Matching To classify one of the noisy characters, simply compare it to the two ‘templates’ on the left Comparison can be done in many ways - here are two: Count the number of places where the template and pattern agree. Pick the class that has the maximum number of agreements. Count the number of places where the template and pattern disagree. Pick the class that has the smallest number of disagreements. This may not work well when there is rotation, scaling, warping, occlusion, etc. Bahadir K. Gunturk 13 ? = Template Matching f g Most popular Question: How can we achieve rotation invariance? Bahadir K. Gunturk 14 Bahadir K. Gunturk 15 Minimum Distance Classifiers Template matching can be expressed mathematically through a notion of distance. Let x be the feature vector for the unknown input, and let m1, m2, ..., mc be templates (i.e., perfect, noise-free feature vectors) for the c classes. The error in matching x against mk is given by || x - mk ||. Choose the class for which the error is a minimum. Since || x - mk || is the distance from x to mk, the technique is called minimum distance classification. Bahadir K. Gunturk 16 Minimum Distance Classifiers x m2 m1 Distance • • • mc a a a “Sum of absolute values” Bahadir K. Gunturk m2 • • • a=x-m1 Euclidean distance Distance T Distance Class x m1 Minimum Selector m3 1/ 2 a1 a2 ... ad a1 a ad 17 Euclidean Distance x is a column vector of d features, x1, x2, ... , xd. By using the transpose operator ' we can convert the column vector x to the row vector x': x= x1 x2 • • • x’ = [x1, x2, ….., xd] xd The inner product of two column vectors x and y is defined by d x’y = x1 y1 + x2 y2 ….., xd yd = S xkyk k=1 Thus the norm of x (using the Euclidean metric) is given by || x || = sqrt( x' x ) Bahadir K. Gunturk 18 Inner Products Important additional properties of inner products: x' y = y' x = || x || || y || cos( angle between x and y ) x' ( y + z ) = x' y + x' z . The inner product of x and y is maximum when the angle between them is zero, i.e., when one is just a positive multiple of the other. Sometimes we say that x' y is the correlation between x and y, and that the correlation is maximum when x and y point in the same direction. If x' y = 0, the vectors x and y are said to be orthogonal or uncorrelated. Bahadir K. Gunturk 19 Minimum Distance Classifiers Example: Let m1=[4.3 1.3]’ and m2=[1.5 0.3]’. Find the decision boundary. Bahadir K. Gunturk 20 Linear Discriminants For minimum distance classifier, we chose the nearest class Use the inner product to express the Euclidean distance from x to mk: ||x-mk||2 = (x -mk)’(x -mk) = x’ x -m’kx - x’ mk+mk’ mk = -2 [m’k x - .5 mk’ mk ]+ x’ x constant constant To find the template mk which minimizes ||x-mk||, it is sufficient to find the mk which maximizes the bracketed term above. Define the linear discriminant function g(x) as g(x) = m’k x - .5 ||mk||2 Bahadir K. Gunturk 21 Min Euclidean distance Classifier A minimum-Euclidean-distance classifier classifies an input feature vector x by computing c linear discriminant functions g1(x), g2(x), ... , gc(x) and assigning x to the class corresponding to the maximum discriminant function. m1 g1(x) m2 g2(x) • • • • • • md Bahadir K. Gunturk gc(x) Class x Maximum Selector 22 Feature Scaling The numerical value for a feature x depends on the units used, .i.e., on the scale. If x is multiplied by a scale factor a, both the mean and the standard deviation are multiplied by a. The variance is multiplied by a2. Sometimes it is desirable to scale the data so that the resulting standard deviation is unity. divide x by the standard deviation s. Similarly, in measuring the distance from x to m, it often makes sense to measure it relative to the standard deviation. Bahadir K. Gunturk 23 Feature Scaling This suggests an important generalization of a minimum-Euclideandistance classifier. Let x(i) be the value for Feature i, let m(i,j) be the mean value of Feature i for Class j, and let s(i,j) be the standard deviation of Feature i for Class j. In measuring the distance between the feature vector x and the mean vector mj for Class j, use the standardized distance 2 r(x,mj)2 = Bahadir K. Gunturk 2 x1 - m1j x -m + 2 2j + s1j s2j •••• + xd - mdj sdj 2 24 Covariance The covariance of two features measures their tendency to vary together, i.e., to co-vary. The variance is the average of the squared deviation of a feature from its mean, the covariance is the average of the products of the deviations of feature values from their means. Consider Feature i and Feature j. Let { x(1,i), x(2,i), ... , x(n,i) } be a set of n examples of Feature i Let { x(1,j), x(2,j), ... , x(n,j) } be a corresponding set of n examples of Feature j Bahadir K. Gunturk 25 Variance Let m(i) be the mean of Feature i Then the variance of Feature i is s(i)2 = [ x(1,i) - m(i) ] [ x(1,i) - m(i) ] + ... + [ x(n,i) - m(i) ] [ x(n,i) - m(i) ] n-1 s(i) is the standard deviation of Feature i Bahadir K. Gunturk 26 Covariance Let m(i) be the mean of Feature i, and m(j) be the mean of Feature j. Then the covariance of Feature i and Feature j is defined by c(i,j) = [ x(1,i) - m(i) ] [ x(1,j) - m(j) ] + ... + [ x(n,i) - m(i) ] [ x(n,j) - m(j) ] n-1 The covariance has several important properties: If Feature i and Feature j tend to increase together, then c(i,j) > 0 If Feature i tends to decrease when Feature j increases, then c(i,j) < 0 If Feature i and Feature j are independent, then c(i,j) = 0 | c(i,j) | <= s(i) s(j), where s(i) is the standard deviation of Feature i c(i,i) = s(i)2 variance of Feature i Bahadir K. Gunturk 27 Covariance Matrix All of the covariances c(i,j) can be collected together into a covariance matrix C: c(1,1) c(2,1) c(1,2) c(2,2) .... .... c(1,d) c(2,d) c(d,1) c(d,2) .... c(d,d) C= Bahadir K. Gunturk 28 Covariance Matrix Need to normalize the distance Recall what we did earlier to get a standardized distance for a single feature: 2 r2 x-m 1 = = (x-m) (x-m) s s2 What is the matrix generalization of the scalar equation? -1 r2 = (x-mx)TCx (x-mx) Bahadir K. Gunturk 29 Bayesian Decision Theory Return to fish example. There are two categories. Denote these categories as w1 for sea bass and w2 for salmon. Assume that there is some prior probability (or simply prior) P(w1) that the next fish is sea bass, and some prior probability that P(w2) that it is salmon. Suppose that we make a decision without making a measurement. The logical decision rule is Decide w1 if P(w1) > P(w2); otherwise decide w2 Bahadir K. Gunturk 31 Bayesian Decision Theory Suppose that we have a feature vector x; now the decision rule is Decide w1 if P(w1 | x) > P(w2 | x); otherwise decide w2 Using the Bayes formula p(x | wi ) P( wi ) P( wi | x) p ( x) where p (x) p(x | wi ) P( wi ) i Bahadir K. Gunturk 32 Bayesian Decision Theory Define a set of discriminant functions gi(x), i=1,…,c p(x | wi ) P( wi ) P( wi | x) p ( x) gi (x) p(x | wi ) P(wi ) OR gi (x) ln p(x | wi ) ln P( wi ) Bahadir K. Gunturk 33 Gaussian Density Univariate Multivariate Bahadir K. Gunturk 2 1 1 x p( x) exp 2 2 p ( x) T 1 1 exp x μ Σ x μ 2 (2 ) d / 2 | Σ |1/ 2 1 34 Example Suppose there are two classes: w1 and w2; and the classification decision is made based on a feature measurement, x. The conditional densities are Gaussian distributions: N(mean,variance) p(x|w1) ~ N(1,1) p(x|w2) ~ N(5,4) The prior probabilities are P(w1) = 0.2 and P(w2) = 0.8 (a) What is the class of an object if its feature x = 2 ? (b) Find the decision boundary when P(w1) = P(w2) = 0.5. (c) Find the decision boundary when P(w1) = 0.2 and P(w2) = 0.8. Bahadir K. Gunturk 35 Gaussian Density p(x) ~ N μ, Σ Center of the cluster is determined by the mean vector, and the shape of the cluster is determined by the covariance matrix. r 2 x μ Σ 1 x μ T “Mahalonobis distance” from x to mean. Bahadir K. Gunturk 36 Discriminant Functions for Gaussian Let us examine the discriminant function for p(x | wi ) ~ N μi , Σi gi (x) ln p(x | wi ) ln P( wi ) gi (x) ln 1 T 1 1 exp x μ Σ x μ i i i ln P( wi ) d /2 1/ 2 (2 ) | Σi | 2 1 d 1 T 1 gi (x) x μi Σi x μ i ln 2 ln Σi ln P( wi ) 2 2 2 Bahadir K. Gunturk 37 Discriminant Functions for Gaussian Case I: Σi 2 I Σi 1/ 1 Bahadir K. Gunturk 2 I g i ( x) 1 2 x μi x μi ln P( wi ) 2 T 38 Discriminant Functions for Gaussian Case I: Σi 2 I Σi 1/ 1 2 I g i ( x) 1 2 x μi x μi ln P( wi ) 2 T As the priors change, the decision boundaries shift. Bahadir K. Gunturk 39 Discriminant Functions for Gaussian Examples: Find the decision boundaries for 1D and 2D Gaussian data. 1 d 1 T 1 gi (x) x μi Σi x μ i ln 2 ln Σi ln P( wi ) 2 2 2 Solve for x from Bahadir K. Gunturk g1 (x) g2 (x) 40 Parameter Estimation We learned how we could design an optimal classifier if we knew the prior probabilities P(wi) and the class-conditional densities p(x|wi). In a typical application, we rarely have complete knowledge. We typically have some general knowledge and a number of design samples (or training data). We use the samples to estimate the unknown probabilities and probability densities, and then use these estimates as if they were true values. If the densities could be parameterized, the problem is simplified significantly. (For example, for Gaussian distribution, mean and covariance matrix are the only parameters we need to estimate.) Bahadir K. Gunturk 41 Parameter Estimation Gaussian case: 1 n μˆ x k n k 1 n 1 ˆ x μˆ x μˆ T Σ k k n k 1 Bahadir K. Gunturk 42 Dimensionality The accuracy degrades when the dimensionality is large. The dimensionality can be reduced by combining features. Linear combinations are attractive because they are simple to compute and analytically tractable. Dimensionality reduction techniques include Principal Component Analysis Fisher’s Discriminant Analysis Bahadir K. Gunturk 43 Principal Component Analysis (PCA) Find a lower dimensional space that best represents the data in a least-squares sense. Full N-dimensional space (here N = 2) Bahadir K. Gunturk d-dimensional subspace (here d = 1) U. of Delaware 44 Principal Component Analysis (PCA) We begin by considering the problem of representing Ndimensional vectors x1, x2, …, xn by a single vector x0. To be more specific, suppose that we want to find a vector x0 such that the sum of squared differences between x0 and xk is as small as possible. Define cost function to be minimized: n J 0 x0 x0 x k 2 k 1 The solution is the sample mean: 1 n x0 μ xk n k 1 Bahadir K. Gunturk 45 Principal Component Analysis (PCA) The sample does not reveal any of the variability in the data. Let’s now consider a solution of the form xk μ ak e where ak is a scalar and e is a unit vector. Define cost function to be minimized: n J1 a1 ,..., an , e μ ak e x k 2 k 1 The solution is ak eT xk μ Bahadir K. Gunturk 46 Principal Component Analysis (PCA) What is the best direction e for the line? n J1 a1 ,..., an , e μ ak e x k k 1 Using ak eT xk μ n We get 2 J1 e e Se μ x k T 2 n where k 1 Find e that maximizes Bahadir K. Gunturk S x k μ x k μ T k 1 eT Se eT e 1 47 Principal Component Analysis (PCA) The solution is Se e n where S x k μ x k μ T k 1 Since eT Se eT e eT e we select the eigenvector corresponding to the largest eigenvalue. Bahadir K. Gunturk 48 Principal Component Analysis (PCA) Generalize it to d dimensions (d<=n) Find the eigenvectors e1, eigenvalues of S. e2, …, ed corresponding to d largest ai eiT xk μ i 1,..., d d x k μ ai ei i 1 Bahadir K. Gunturk 49 Face Recognition Probe ? Bahadir K. Gunturk 50 Eigenface Approach Reduce the dimensionality by applying PCA: Apply PCA to a training dataset to find the first d principal components. (d=8) Find the weights a1 ,..., a8 for all images. Classify the probe using norm distance. Bahadir K. Gunturk 51