Download x - CEDAR

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
Bayes Decision Theory
Minimum-Error-Rate Classification
Classifiers, Discriminant Functions and Decision Surfaces
The Normal Density
CSE 555: Srihari
0
Minimum-Error-Rate Classification
• Actions are decisions on classes
If action αi is taken and the true state of nature is ωj
then: decision is correct if i = j and in error if i ≠ j
Seek a decision rule that minimizes the
probability of error which is the error rate
CSE 555: Srihari
1
Minimum Error Rate Classifier Derivation
•
zero-one loss function:
•
Therefore, the conditional risk is:
⎧0 i = j
⎩1 i ≠ j
λ ( α i ,ω j ) = ⎨
i , j = 1 ,..., c
j =c
R( α i | x ) = ∑ λ ( α i | ω j ) P ( ω j | x )
j =1
= ∑ P( ω j | x ) = 1 − P( ω i | x )
j≠1
•
The risk corresponding to this loss function is the average probability error”
Minimize the risk requires maximize P(ωi | x)
(since R(αi | x) = 1 – P(ωi | x))
For Minimum error rate
Decide ωi if P (ωi | x) > P(ωj | x) ∀j ≠ i
CSE 555: Srihari
2
Likelihood Ratio Classification
• Regions of decision and zero-one loss function,
λ12 − λ 22 P ( ω 2 )
P( x | ω1 )
= θ λ then decide ω 1 if :
> θλ
Let
.
λ 21 − λ11 P ( ω 1 )
P( x | ω 2 )
• If λ is the zero-one loss function which means:
⎛ 0 1⎞
⎟⎟
1
0
⎠
⎝
λ = ⎜⎜
then θ λ =
P( ω 2 )
= θa
P( ω1 )
⎛0 2 ⎞
2 P( ω 2 )
⎟⎟ then θ λ =
if λ = ⎜⎜
= θb
P
(
ω
)
1
0
⎠
⎝
1
Class-conditional
pdfs
CSE 555: Srihari
• Likelihood Ratio p(x/ω1)/p(x/ω2).
• If we use a zero-one loss function
decision boundaries are determined by
threshold θa.
• If loss function penalizes miscategorizing ω2
as ω1 more than converse we get larger
threshold θb and hence R1 becomes smaller3
Classifiers, Discriminant Functions
and Decision Surfaces
• Many methods of representing pattern classifiers
Set of discriminant functions gi(x), i = 1,…, c
Classifier assigns feature x to class ωi if gi(x) > gj(x) ∀j ≠ i
Classifier is a machine
that computes c
discriminant functions
Functional structure of a general statistical pattern
Classifier with d inputs and c discriminant functions
gi(x)
CSE 555: Srihari
4
Forms of Discriminant Functions
• Let gi(x) = - R(αi | x)
(max. discriminant corresponds to min. risk!)
• For the minimum error rate, we take
gi(x) = P(ωi | x)
(max. discrimination corresponds to max. posterior!)
gi(x) ≡ P(x | ωi) P(ωi)
gi(x) = ln P(x | ωi) + ln P(ωi)
CSE 555: Srihari
5
Decision Region
•
Feature space divided into c decision regions
if gi(x) > gj(x) ∀j ≠ i then x is in Ri
2-D, two-category classifier with Gaussian pdfs
Decision Boundary = two hyperbolas
Hence decision region R2 is not simply connected
CSE 555: Srihari
Ellipses mark where density is 1/e times that of peak distribution
6
The Two-Category case
A classifier is a dichotomizer
that has two discriminant functions g1 and g2
Let g(x) ≡ g1(x) – g2(x)
Decide ω1 if g(x) > 0 ; Otherwise decide ω2
The computation of g(x)
g( x ) = P ( ω 1 | x ) − P ( ω 2 | x )
P( x | ω1 )
P( ω1 )
= ln
+ ln
P( x | ω 2 )
P( ω 2 )
CSE 555: Srihari
7
The Normal Distribution
A bell-shaped distribution defined by the probability density function
p( x) =
1
2πσ 2
e
1 x−µ 2
− (
)
2 σ
If the random variable X follows a normal distribution, then
• The probability that X will fall into the interval (a,b) is
given by
b
•
•
•
∫
a
p ( x)dx
Expected, or mean, value of X is
∞
E[ X ] =
∫ xp( x)dx =µ
−∞
Variance of X is
∞
Var ( x) = E[( x − µ ) 2 ] = ∫ ( x − µ ) 2 p( x)dx = σ 2
−∞
Standard deviation of X,σ 2, is
CSE 555: Srihari
σx =σ
8
Relationship between Entropy and Normal Density
Entropy of a distribution
∞
H ( p ( x)) =
∫ p( x) ln p( x)dx
−∞
Measured in nats. If log2 is uses the unit is bits
Entropy measures uncertainty in the values of points
selected randomly from a distribution
Normal distribution has maximum entropy over all
distributions having a given mean and variance
CSE 555: Srihari
9
Normal Distribution, Mean 0, Standard
Deviation 1
With 80% confidence the r.v. will lie in the two-sided interval[-1.28,1.28]
CSE 555: Srihari
10
The Normal Density in Pattern Recognition
•
Univariate density
• Analytically tractable, continuous
• A lot of processes are asymptotically Gaussian
• Central Limit Theorem: aggregate effect of a sum of a large number of
•
P( x ) =
small, independent random disturbances will lead to a Gaussian
distribution
Handwritten characters, speech sounds are ideal or prototype
corrupted by random process
⎡ 1 ⎛ x − µ ⎞2 ⎤
1
exp ⎢ − ⎜
⎟ ⎥,
2π σ
⎢⎣ 2 ⎝ σ ⎠ ⎥⎦
Where:
µ = mean (or expected value) of x
σ2 = expected squared deviation
or variance
Univariate normal distribution has roughly 95% of its area in the range
11
CSE 555: Srihari |x-µ|<2σ.
The peak of the distribution has value p(µ)=1/sqrt(2πσ)
Multivariate density
Multivariate normal density in d dimensions is:
p( x) =
1
(2π ) d / 2 Σ
1/ 2
⎡ 1
⎤
exp ⎢− ( x − µ ) t Σ −1 ( x − µ )⎥
⎣ 2
⎦
abbreviated as
p ( x) ~ N ( µ , Σ)
where:
x = (x1, x2, …, xd)t (t stands for the transpose vector form)
µ = (µ1, µ2, …, µd)t mean vector
Σ = d*d covariance matrix
|Σ| and Σ-1 are determinant and inverse respectively
CSE 555: Srihari
12
Mean and Covariance Matrix
• Formal Definitions
µ = Ε[ x] = ∫ xp( x)dx
t
t
=
Ε
[(
x
−
µ
)(
x
−
µ
)
=
(
x
−
µ
)(
x
−
µ
)
p ( x)dx
∑
∫
Mean vector has its components which are means of variables
Covariance :
Diagonal elements are variances of variables
Cross-diagonal elements are covariances of pairs of variables
Statistical independence means off-diagonal elements are zero
CSE 555: Srihari
13
Multivariate Normal Density
• Specified by d+d(d+1)/2 parameters: mean and
independent elements of covariance matrix
Locii of points
of constant
density are
hyperellipsoids
Samples drawn from a 2-D Gaussian lie in a cloud centered at the mean µ.
14
CSE 555:
Srihari
Ellipses show
lines
of equal probability density of the Gaussian
Linear Combinations of Normally distributed
variables are normally distributed
Whitening Transform
Φ is matrix whose columns are the orthonormal Eigen vectors
of ∑ and A is diagonal matrix of corresponding Eigen values
then the transformation A w = ΦA−1/ 2 applied to the coordinates
ensures that transformed distribution has covariance matrix equal
to the identity matrix
Action of a linear transformation on the feature space will convert an arbitrary normal
distribution into another normal distribution. One transformation A, takes the source
distribution into distribution N(Atµ,AtΣA). Another linear transformation- a projection P onto
a Line defined by vector a– leads to N(µ, σ2) measured along that line. While the
transforms yield distributions in a different space they are shown superimposed on the
original x1-x2 space. A whitening transform Aw leads to a circularly symmetric Gaussian,
here shown displaced.
CSE 555: Srihari
15
Mahanalobis Distance
r 2 = ( x − µ ) t Σ −1 ( x − µ )
is the Mahanalobis distance from x toµ
Contours of constant
Density are hyperellipsoids
of constant Mahanalobis
Distance
For a given dimensionality
Scatter of samples varies
directly with |E|1/2
Samples drawn from a 2-D Gaussian lie in a cloud centered at the mean µ.
Ellipses show lines of equal probability density of the Gaussian
CSE 555: Srihari
16