Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Bayes Decision Theory Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density CSE 555: Srihari 0 Minimum-Error-Rate Classification • Actions are decisions on classes If action αi is taken and the true state of nature is ωj then: decision is correct if i = j and in error if i ≠ j Seek a decision rule that minimizes the probability of error which is the error rate CSE 555: Srihari 1 Minimum Error Rate Classifier Derivation • zero-one loss function: • Therefore, the conditional risk is: ⎧0 i = j ⎩1 i ≠ j λ ( α i ,ω j ) = ⎨ i , j = 1 ,..., c j =c R( α i | x ) = ∑ λ ( α i | ω j ) P ( ω j | x ) j =1 = ∑ P( ω j | x ) = 1 − P( ω i | x ) j≠1 • The risk corresponding to this loss function is the average probability error” Minimize the risk requires maximize P(ωi | x) (since R(αi | x) = 1 – P(ωi | x)) For Minimum error rate Decide ωi if P (ωi | x) > P(ωj | x) ∀j ≠ i CSE 555: Srihari 2 Likelihood Ratio Classification • Regions of decision and zero-one loss function, λ12 − λ 22 P ( ω 2 ) P( x | ω1 ) = θ λ then decide ω 1 if : > θλ Let . λ 21 − λ11 P ( ω 1 ) P( x | ω 2 ) • If λ is the zero-one loss function which means: ⎛ 0 1⎞ ⎟⎟ 1 0 ⎠ ⎝ λ = ⎜⎜ then θ λ = P( ω 2 ) = θa P( ω1 ) ⎛0 2 ⎞ 2 P( ω 2 ) ⎟⎟ then θ λ = if λ = ⎜⎜ = θb P ( ω ) 1 0 ⎠ ⎝ 1 Class-conditional pdfs CSE 555: Srihari • Likelihood Ratio p(x/ω1)/p(x/ω2). • If we use a zero-one loss function decision boundaries are determined by threshold θa. • If loss function penalizes miscategorizing ω2 as ω1 more than converse we get larger threshold θb and hence R1 becomes smaller3 Classifiers, Discriminant Functions and Decision Surfaces • Many methods of representing pattern classifiers Set of discriminant functions gi(x), i = 1,…, c Classifier assigns feature x to class ωi if gi(x) > gj(x) ∀j ≠ i Classifier is a machine that computes c discriminant functions Functional structure of a general statistical pattern Classifier with d inputs and c discriminant functions gi(x) CSE 555: Srihari 4 Forms of Discriminant Functions • Let gi(x) = - R(αi | x) (max. discriminant corresponds to min. risk!) • For the minimum error rate, we take gi(x) = P(ωi | x) (max. discrimination corresponds to max. posterior!) gi(x) ≡ P(x | ωi) P(ωi) gi(x) = ln P(x | ωi) + ln P(ωi) CSE 555: Srihari 5 Decision Region • Feature space divided into c decision regions if gi(x) > gj(x) ∀j ≠ i then x is in Ri 2-D, two-category classifier with Gaussian pdfs Decision Boundary = two hyperbolas Hence decision region R2 is not simply connected CSE 555: Srihari Ellipses mark where density is 1/e times that of peak distribution 6 The Two-Category case A classifier is a dichotomizer that has two discriminant functions g1 and g2 Let g(x) ≡ g1(x) – g2(x) Decide ω1 if g(x) > 0 ; Otherwise decide ω2 The computation of g(x) g( x ) = P ( ω 1 | x ) − P ( ω 2 | x ) P( x | ω1 ) P( ω1 ) = ln + ln P( x | ω 2 ) P( ω 2 ) CSE 555: Srihari 7 The Normal Distribution A bell-shaped distribution defined by the probability density function p( x) = 1 2πσ 2 e 1 x−µ 2 − ( ) 2 σ If the random variable X follows a normal distribution, then • The probability that X will fall into the interval (a,b) is given by b • • • ∫ a p ( x)dx Expected, or mean, value of X is ∞ E[ X ] = ∫ xp( x)dx =µ −∞ Variance of X is ∞ Var ( x) = E[( x − µ ) 2 ] = ∫ ( x − µ ) 2 p( x)dx = σ 2 −∞ Standard deviation of X,σ 2, is CSE 555: Srihari σx =σ 8 Relationship between Entropy and Normal Density Entropy of a distribution ∞ H ( p ( x)) = ∫ p( x) ln p( x)dx −∞ Measured in nats. If log2 is uses the unit is bits Entropy measures uncertainty in the values of points selected randomly from a distribution Normal distribution has maximum entropy over all distributions having a given mean and variance CSE 555: Srihari 9 Normal Distribution, Mean 0, Standard Deviation 1 With 80% confidence the r.v. will lie in the two-sided interval[-1.28,1.28] CSE 555: Srihari 10 The Normal Density in Pattern Recognition • Univariate density • Analytically tractable, continuous • A lot of processes are asymptotically Gaussian • Central Limit Theorem: aggregate effect of a sum of a large number of • P( x ) = small, independent random disturbances will lead to a Gaussian distribution Handwritten characters, speech sounds are ideal or prototype corrupted by random process ⎡ 1 ⎛ x − µ ⎞2 ⎤ 1 exp ⎢ − ⎜ ⎟ ⎥, 2π σ ⎢⎣ 2 ⎝ σ ⎠ ⎥⎦ Where: µ = mean (or expected value) of x σ2 = expected squared deviation or variance Univariate normal distribution has roughly 95% of its area in the range 11 CSE 555: Srihari |x-µ|<2σ. The peak of the distribution has value p(µ)=1/sqrt(2πσ) Multivariate density Multivariate normal density in d dimensions is: p( x) = 1 (2π ) d / 2 Σ 1/ 2 ⎡ 1 ⎤ exp ⎢− ( x − µ ) t Σ −1 ( x − µ )⎥ ⎣ 2 ⎦ abbreviated as p ( x) ~ N ( µ , Σ) where: x = (x1, x2, …, xd)t (t stands for the transpose vector form) µ = (µ1, µ2, …, µd)t mean vector Σ = d*d covariance matrix |Σ| and Σ-1 are determinant and inverse respectively CSE 555: Srihari 12 Mean and Covariance Matrix • Formal Definitions µ = Ε[ x] = ∫ xp( x)dx t t = Ε [( x − µ )( x − µ ) = ( x − µ )( x − µ ) p ( x)dx ∑ ∫ Mean vector has its components which are means of variables Covariance : Diagonal elements are variances of variables Cross-diagonal elements are covariances of pairs of variables Statistical independence means off-diagonal elements are zero CSE 555: Srihari 13 Multivariate Normal Density • Specified by d+d(d+1)/2 parameters: mean and independent elements of covariance matrix Locii of points of constant density are hyperellipsoids Samples drawn from a 2-D Gaussian lie in a cloud centered at the mean µ. 14 CSE 555: Srihari Ellipses show lines of equal probability density of the Gaussian Linear Combinations of Normally distributed variables are normally distributed Whitening Transform Φ is matrix whose columns are the orthonormal Eigen vectors of ∑ and A is diagonal matrix of corresponding Eigen values then the transformation A w = ΦA−1/ 2 applied to the coordinates ensures that transformed distribution has covariance matrix equal to the identity matrix Action of a linear transformation on the feature space will convert an arbitrary normal distribution into another normal distribution. One transformation A, takes the source distribution into distribution N(Atµ,AtΣA). Another linear transformation- a projection P onto a Line defined by vector a– leads to N(µ, σ2) measured along that line. While the transforms yield distributions in a different space they are shown superimposed on the original x1-x2 space. A whitening transform Aw leads to a circularly symmetric Gaussian, here shown displaced. CSE 555: Srihari 15 Mahanalobis Distance r 2 = ( x − µ ) t Σ −1 ( x − µ ) is the Mahanalobis distance from x toµ Contours of constant Density are hyperellipsoids of constant Mahanalobis Distance For a given dimensionality Scatter of samples varies directly with |E|1/2 Samples drawn from a 2-D Gaussian lie in a cloud centered at the mean µ. Ellipses show lines of equal probability density of the Gaussian CSE 555: Srihari 16