Download EE 7730: Lecture 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Types of artificial neural networks wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Karhunen–Loève theorem wikipedia , lookup

Transcript
EE 4780
Pattern Classification
Classification Example
Goal: Automatically classify incoming
fish according to species, and send to
respective packing plants.
Features: Length, width, color,
brightness, etc.
Model: Sea bass have some typical
length, and it is greater than that for
salmon.
Classifier: If the fish is longer than a
value, l*, classify it as sea bass.
Training Samples: To choose l*, make
length measurements from training
samples and inspect the results.
Bahadir K. Gunturk
2
Classification Example
Bahadir K. Gunturk
3
Classification Example
Bahadir K. Gunturk
4
Classification Example
Decision boundary
Bahadir K. Gunturk
Now, we have two features two classify
the fish: the lightness x1, and the width
x2.
Feature vector: x=[x1 x2]’.
The feature extractor reduces the image
of a fish to a feature vector x in a 2D
feature space.
5
Classification Example
Bahadir K. Gunturk
6
Classification Example
Bahadir K. Gunturk
7
Feature Extraction


The goal of feature extractor is to characterize an object to
be recognized by measurements whose values are very
similar for objects in the same category, and very different
for objects in different categories.
The features should be invariant to the irrelevant
transformation of the input. For example, the location of a
fish on the belt is irrelevant, and thus the representation
should be insensitive to the location of the fish.
Bahadir K. Gunturk
8
Classification



The task of the classifier is to use feature vectors (provided
by the feature extractor) to assign the object to a category.
Perfect classification is often impossible, a more general
task is to determine the probability for each of the possible
categories.
The process of using data to determine the classifier is
referred to as training the classifier.
Bahadir K. Gunturk
9
Classical Model
Raw Data

Feature
Extractor
x1
x2
•
•
•
Classifier
xd
Class
1 or 2 or ….. or c
We measure a fixed set of d features for an object that we
want to classify.

For example,




Bahadir K. Gunturk
x1 = height
x2 = perimeter
...
xd = average pixel intensity
10
Feature Vectors

We can think of our feature set as a feature vector x, where x is
the d-dimensional column vector
x=
•
•
•
xd


x3
x1
x2
x
x1
x2
Can think of x as being a point in a d-dimensional feature
space.
By this process of feature measurement, we can represent an
object as a point in feature space.
Bahadir K. Gunturk
11
What is ahead






Bahadir K. Gunturk
Template matching
Minimum-distance classifiers
Metrics
Inner products
Linear discriminants
Bayesian approach
12
Template Matching


To classify one of the noisy characters, simply compare it to the two
‘templates’ on the left
Comparison can be done in many ways - here are two:



Count the number of places where the template and pattern agree. Pick the
class that has the maximum number of agreements.
Count the number of places where the template and pattern disagree. Pick the
class that has the smallest number of disagreements.
This may not work well when there is rotation, scaling, warping, occlusion,
etc.
Bahadir K. Gunturk
13
?
=
Template Matching
f
g
Most
popular
Question: How can we achieve rotation invariance?
Bahadir K. Gunturk
14
Bahadir K. Gunturk
15
Minimum Distance Classifiers

Template matching can be expressed mathematically through a notion of
distance.

Let x be the feature vector for the unknown input, and let m1, m2, ..., mc
be templates (i.e., perfect, noise-free feature vectors) for the c classes.

The error in matching x against mk is given by || x - mk ||.

Choose the class for which the error is a minimum.

Since || x - mk || is the distance from x to mk, the technique is called
minimum distance classification.
Bahadir K. Gunturk
16
Minimum Distance Classifiers
x
m2
m1
Distance
•
•
•
mc
a  a a 
“Sum of absolute values”
Bahadir K. Gunturk
m2
•
•
•
a=x-m1
Euclidean distance
Distance
T
Distance
Class
x
m1
Minimum Selector
m3
1/ 2
a1  a2  ...  ad
 a1 
a   
 ad 
17
Euclidean Distance


x is a column vector of d features, x1, x2, ... , xd.
By using the transpose operator ' we can convert the column vector x to the
row vector x':
x=
x1
x2
•
•
•
x’ = [x1, x2, ….., xd]
xd

The inner product of two column vectors x and y is defined by
d
x’y = x1 y1 + x2 y2 ….., xd yd = S xkyk
k=1

Thus the norm of x (using the Euclidean metric) is given by
|| x || = sqrt( x' x )
Bahadir K. Gunturk
18
Inner Products

Important additional properties of inner products:

x' y = y' x = || x || || y || cos( angle between x and y )

x' ( y + z ) = x' y + x' z .

The inner product of x and y is maximum when the angle between
them is zero, i.e., when one is just a positive multiple of the other.
 Sometimes we say
 that x' y is the correlation between x and y, and
 that the correlation is maximum when x and y point in the same
direction.

If x' y = 0, the vectors x and y are said to be orthogonal or uncorrelated.
Bahadir K. Gunturk
19
Minimum Distance Classifiers
Example: Let m1=[4.3 1.3]’ and m2=[1.5 0.3]’. Find the decision boundary.
Bahadir K. Gunturk
20
Linear Discriminants


For minimum distance classifier, we chose the nearest class
Use the inner product to express the Euclidean distance from x to mk:
||x-mk||2 = (x -mk)’(x -mk) = x’ x -m’kx - x’ mk+mk’ mk
= -2 [m’k x - .5 mk’ mk ]+ x’ x
constant


constant
To find the template mk which minimizes ||x-mk||, it is sufficient to find the
mk which maximizes the bracketed term above.
Define the linear discriminant function g(x) as
g(x) = m’k x - .5 ||mk||2
Bahadir K. Gunturk
21
Min Euclidean distance Classifier
A minimum-Euclidean-distance classifier classifies an input feature
vector x by computing c linear discriminant functions
g1(x), g2(x), ... , gc(x)
and assigning x to the class corresponding to the maximum discriminant
function.
m1
g1(x)
m2
g2(x)
•
•
•
•
•
•
md
Bahadir K. Gunturk
gc(x)
Class
x
Maximum Selector

22
Feature Scaling




The numerical value for a feature x depends on the units used, .i.e., on
the scale.
If x is multiplied by a scale factor a, both the mean and the standard
deviation are multiplied by a.
 The variance is multiplied by a2.
Sometimes it is desirable to scale the data so that the resulting standard
deviation is unity.
 divide x by the standard deviation s.
Similarly, in measuring the distance from x to m, it often makes sense to
measure it relative to the standard deviation.
Bahadir K. Gunturk
23
Feature Scaling


This suggests an important generalization of a minimum-Euclideandistance classifier.

Let x(i) be the value for Feature i,

let m(i,j) be the mean value of Feature i for Class j, and

let s(i,j) be the standard deviation of Feature i for Class j.
In measuring the distance between the feature vector x and the mean
vector mj for Class j, use the standardized distance
2
r(x,mj)2 =
Bahadir K. Gunturk
2
x1 - m1j
x -m
+ 2 2j +
s1j
s2j
••••
+
xd - mdj
sdj
2
24
Covariance

The covariance of two features measures their tendency to vary
together, i.e., to co-vary.

The variance is the average of the squared deviation of a feature from
its mean, the covariance is the average of the products of the deviations
of feature values from their means.

Consider Feature i and Feature j.

Let { x(1,i), x(2,i), ... , x(n,i) } be a set of n examples of Feature i

Let { x(1,j), x(2,j), ... , x(n,j) } be a corresponding set of n examples
of Feature j
Bahadir K. Gunturk
25
Variance


Let m(i) be the mean of Feature i
Then the variance of Feature i is
s(i)2

=
[ x(1,i) - m(i) ] [ x(1,i) - m(i) ] + ... + [ x(n,i) - m(i) ] [ x(n,i) - m(i) ]
n-1
s(i) is the standard deviation of Feature i
Bahadir K. Gunturk
26
Covariance


Let m(i) be the mean of Feature i, and m(j) be the mean of Feature j.
Then the covariance of Feature i and Feature j is defined by
c(i,j) =

[ x(1,i) - m(i) ] [ x(1,j) - m(j) ] + ... + [ x(n,i) - m(i) ] [ x(n,j) - m(j) ]
n-1
The covariance has several important properties:
 If Feature i and Feature j tend to increase together, then c(i,j) > 0
 If Feature i tends to decrease when Feature j increases, then c(i,j) < 0
 If Feature i and Feature j are independent, then c(i,j) = 0
 | c(i,j) | <= s(i) s(j), where s(i) is the standard deviation of Feature i
 c(i,i) = s(i)2  variance of Feature i
Bahadir K. Gunturk
27
Covariance Matrix

All of the covariances c(i,j) can be collected together into a covariance
matrix C:
c(1,1)
c(2,1)
c(1,2)
c(2,2)
....
....
c(1,d)
c(2,d)
c(d,1)
c(d,2)
....
c(d,d)
C=
Bahadir K. Gunturk
28
Covariance Matrix

Need to normalize the distance
 Recall what we did earlier to get a standardized distance for a single
feature:
2
r2

x-m
1
=
=
(x-m)
(x-m)
s
s2
What is the matrix generalization of the scalar equation?
-1
r2 = (x-mx)TCx (x-mx)
Bahadir K. Gunturk
29
Bayesian Decision Theory



Return to fish example. There are two categories. Denote
these categories as w1 for sea bass and w2 for salmon.
Assume that there is some prior probability (or simply prior)
P(w1) that the next fish is sea bass, and some prior
probability that P(w2) that it is salmon.
Suppose that we make a decision without making a
measurement. The logical decision rule is
Decide w1 if P(w1) > P(w2); otherwise decide w2
Bahadir K. Gunturk
31
Bayesian Decision Theory


Suppose that we have a feature vector x; now the decision
rule is
Decide w1 if P(w1 | x) > P(w2 | x); otherwise decide w2
Using the Bayes formula
p(x | wi ) P( wi )
P( wi | x) 
p ( x)
where
p (x)   p(x | wi ) P( wi )
i
Bahadir K. Gunturk
32
Bayesian Decision Theory

Define a set of discriminant functions gi(x), i=1,…,c
p(x | wi ) P( wi )
P( wi | x) 
p ( x)
gi (x)  p(x | wi ) P(wi )
OR
gi (x)  ln p(x | wi )  ln P( wi )
Bahadir K. Gunturk
33
Gaussian Density
Univariate
Multivariate
Bahadir K. Gunturk
2

1
1 x  
p( x) 
exp   
 
2

2
 
 
p ( x) 
T
 1

1
exp

x

μ
Σ
x

μ




 2

(2 ) d / 2 | Σ |1/ 2
1
34
Example
Suppose there are two classes: w1 and w2; and the classification decision
is made based on a feature measurement, x.
The conditional densities are Gaussian distributions: N(mean,variance)
p(x|w1) ~ N(1,1)
p(x|w2) ~ N(5,4)
The prior probabilities are P(w1) = 0.2 and P(w2) = 0.8
(a) What is the class of an object if its feature x = 2 ?
(b) Find the decision boundary when P(w1) = P(w2) = 0.5.
(c) Find the decision boundary when P(w1) = 0.2 and P(w2) = 0.8.
Bahadir K. Gunturk
35
Gaussian Density
p(x) ~ N  μ, Σ 
Center of the cluster is determined by the
mean vector, and the shape of the cluster is
determined by the covariance matrix.
r 2   x  μ  Σ 1  x  μ 
T
“Mahalonobis distance” from x to mean.
Bahadir K. Gunturk
36
Discriminant Functions for Gaussian

Let us examine the discriminant function for p(x | wi ) ~ N  μi , Σi 
gi (x)  ln p(x | wi )  ln P( wi )
gi (x)  ln
1
T
 1

1
exp

x

μ
Σ
x

μ

i
i 
i    ln P( wi )
d /2
1/ 2

(2 ) | Σi |
 2

1
d
1
T
1
gi (x)    x  μi  Σi  x  μ i   ln 2  ln Σi  ln P( wi )
2
2
2
Bahadir K. Gunturk
37
Discriminant Functions for Gaussian

Case I: Σi   2 I
Σi  1/ 
1
Bahadir K. Gunturk
2
I
g i ( x)  
1
2
x  μi   x  μi   ln P( wi )
2 
T
38
Discriminant Functions for Gaussian

Case I: Σi   2 I
Σi  1/ 
1
2
I
g i ( x)  
1
2
x  μi   x  μi   ln P( wi )
2 
T
As the priors change, the decision boundaries shift.
Bahadir K. Gunturk
39
Discriminant Functions for Gaussian

Examples: Find the decision boundaries for 1D and 2D
Gaussian data.
1
d
1
T
1
gi (x)    x  μi  Σi  x  μ i   ln 2  ln Σi  ln P( wi )
2
2
2
Solve for x from
Bahadir K. Gunturk
g1 (x)  g2 (x)
40
Parameter Estimation




We learned how we could design an optimal classifier if we
knew the prior probabilities P(wi) and the class-conditional
densities p(x|wi).
In a typical application, we rarely have complete
knowledge. We typically have some general knowledge
and a number of design samples (or training data).
We use the samples to estimate the unknown probabilities
and probability densities, and then use these estimates as
if they were true values.
If the densities could be parameterized, the problem is
simplified significantly. (For example, for Gaussian
distribution, mean and covariance matrix are the only
parameters we need to estimate.)
Bahadir K. Gunturk
41
Parameter Estimation
Gaussian case:
1 n
μˆ   x k
n k 1
n
1
ˆ    x  μˆ  x  μˆ T
Σ
k
k
n k 1
Bahadir K. Gunturk
42
Dimensionality




The accuracy degrades when the dimensionality is large.
The dimensionality can be reduced by combining features.
Linear combinations are attractive because they are simple
to compute and analytically tractable.
Dimensionality reduction techniques include


Principal Component Analysis
Fisher’s Discriminant Analysis
Bahadir K. Gunturk
43
Principal Component Analysis (PCA)

Find a lower dimensional space that best represents the
data in a least-squares sense.
Full N-dimensional
space (here N = 2)
Bahadir K. Gunturk
d-dimensional subspace
(here d = 1)
U. of Delaware
44
Principal Component Analysis (PCA)



We begin by considering the problem of representing Ndimensional vectors x1, x2, …, xn by a single vector x0.
To be more specific, suppose that we want to find a vector
x0 such that the sum of squared differences between x0
and xk is as small as possible.
Define cost function to be minimized:
n
J 0  x0    x0  x k
2
k 1

The solution is the sample mean:
1 n
x0  μ   xk
n k 1
Bahadir K. Gunturk
45
Principal Component Analysis (PCA)

The sample does not reveal any of the variability in the
data. Let’s now consider a solution of the form
xk  μ  ak e

where ak is a scalar and e is a unit vector.
Define cost function to be minimized:
n
J1  a1 ,..., an , e     μ  ak e   x k
2
k 1

The solution is
ak  eT  xk  μ 
Bahadir K. Gunturk
46
Principal Component Analysis (PCA)

What is the best direction e for the line?
n
J1  a1 ,..., an , e     μ  ak e   x k
k 1
Using
ak  eT  xk  μ 
n
We get
2
J1  e   e Se   μ  x k
T
2
n
where
k 1
Find e that maximizes
Bahadir K. Gunturk
S    x k  μ  x k  μ 
T
k 1
eT Se    eT e  1
47
Principal Component Analysis (PCA)

The solution is
Se  e
n
where
S    x k  μ  x k  μ 
T
k 1
Since
eT Se  eT e  eT e  
we select the eigenvector corresponding to the largest eigenvalue.
Bahadir K. Gunturk
48
Principal Component Analysis (PCA)

Generalize it to d dimensions (d<=n)
Find the eigenvectors e1,
eigenvalues of S.
e2, …, ed corresponding to d largest
ai  eiT  xk  μ 
i  1,..., d
d
x k  μ   ai ei
i 1
Bahadir K. Gunturk
49
Face Recognition
Probe
?
Bahadir K. Gunturk
50
Eigenface Approach

Reduce the dimensionality by applying PCA:

Apply PCA to a training dataset to find the first d principal
components.
(d=8)


Find the weights a1 ,..., a8 for all images.
Classify the probe using norm distance.
Bahadir K. Gunturk
51