Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Sufficient statistic wikipedia , lookup
Inductive probability wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
History of statistics wikipedia , lookup
Expectation–maximization algorithm wikipedia , lookup
German tank problem wikipedia , lookup
Statistical inference wikipedia , lookup
Outline • • • • • • • • • Review Maximum A-Posteriori (MAP) Estimation Bayesian Parameter Estimation Example:The Gaussian Case Recursive Bayesian Incremental Learning Problems of Dimensionality Linear Algebra review Principal Component Analysis Fisher Discriminant 18 March 2016 SIV 813 1 Bayesian Decision Theory • Bayesian decision theory is a fundamental statistical approach to the problem of pattern classification. Ø Decision making when all the probabilistic information is known. Ø For given probabilities the decision is optimal. Ø When new information is added, it is assimilated in optimal fashion for improvement of decisions. 18 March 2016 SIV 813 2 Bayes' formula P(ωj | x) = P(x |ωj ) P(ωj ) / P(x), where 2 P( x) = ∑ p( x | ω j ) P(ω j ) j =1 Likelihood ∗ Prior Posterior = Evidence 18 March 2016 SIV 813 3 Bayes' formula cont. • p(x|ωj ) is called the likelihood of ωj with respect to x. (the ωj category for which p(x|ωj ) is large is more "likely" to be the true category) • p(x) is the evidence how frequently we will measure a pattern with feature value x. Scale factor that guarantees that the posterior probabilities sum to 1. 18 March 2016 SIV 813 4 Bayes' Decision Rule (Minimizes the probability of error) ω1 : if P(ω1|x) > P(ω2|x) ω2 : otherwise or ω1 : if P ( x |ω1) P(ω1) > P(x|ω2) P(ω2) ω2 : otherwise and P(Error|x) = min [P(ω1|x) , P(ω2|x)] 18 March 2016 SIV 813 5 Normal Density - Univariate Case • Gaussian density with mean µ ∈ ° and standard deviation 2 ( named variance ) σ ∈° +, σ 2 ⎡ 1 1 ⎛ x − µ ⎞ ⎤ p(x) = exp ⎢− ⎜ ⎟ ⎥ 1/ 2 (2π ) σ ⎢⎣ 2 ⎝ σ ⎠ ⎥⎦ 2 p ( x) ~ N ( µ , σ ) • It can be shown that: ∞ µ = E[ x] = ∫ ∞ xp( x)dx, σ 2 = E[( x − µ )2 ] = −∞ 18 March 2016 2 ( x − µ ) p( x)dx. ∫ −∞ SIV 813 6 Normal Density - Multivariate Case • The general multivariate normal density (MND) in a d dimensions is written as 1 ⎡ 1 ⎤ t −1 p ( x) = exp − ( x − µ ) Σ ( x − µ ) ⎢⎣ 2 ⎥⎦ (2π ) d / 2 | Σ |1/ 2 • It can be shown that: µ = E[x] = ∫ ° Σ = E[(x − µ)(x − µ)t ] . x p(x)dx, d which means for components σ i j = E[( xi − µi )( x j − µ j )] . • The covariance matrix semidefinite. 18 March 2016 Σ is always symmetric and positive SIV 813 7 Normal Density - Multivariate Case • The general multivariate normal density (MND) in a d dimensions is written as 1 ⎡ 1 ⎤ t −1 p ( x) = exp − ( x − µ ) Σ ( x − µ ) ⎢⎣ 2 ⎥⎦ (2π ) d / 2 | Σ |1/ 2 • It can be shown that: µ = E[x] = Σ = E[(x − µ)(x − µ)t ] . ∫ x p(x)dx, Rd which means for components σ i j = E[( xi − µi )( x j − µ j )] . 18 March 2016 SIV 813 8 Maximum Likelihood and Bayesian Parameter Estimation • To design an optimal classifier we need P(ωi) and p(x| ωi), but usually we do not know them. • Solution – to use training data to estimate the unknown probabilities. Estimation of classconditional densities is a difficult task. 18 March 2016 SIV 813 9 Maximum Likelihood and Bayesian Parameter Estimation • Supervised learning: we get to see samples from each of the classes “separately” (called tagged or labeled samples). • Tagged samples are “expensive”. We need to learn the distributions as efficiently as possible. • Two methods: parametric (easier) and nonparametric (harder) 18 March 2016 SIV 813 10 Maximum Likelihood and Bayesian Parameter Estimation • Program for parametric methods: Ø Assume specific parametric distributions with parameters θ ∈Θ ⊂ Rp Ø Estimate parameters $q(D) from training data D. Ø Replace true value of class-conditional density with approximation and apply the Bayesian framework for decision making. 18 March 2016 SIV 813 11 Maximum Likelihood and Bayesian Parameter Estimation • Suppose we can assume that the relevant (class-conditional) densities are of some parametric form. That is, p(x|ω)=p(x|θ), where θ ∈ Θ ⊂ R p • Examples of parameterized densities: – Binomial: x(n) has m 1’s and n-m 0’s ⎛ n ⎞ p( x ( n ) | θ ) = ⎜ ⎟θ m (1 − θ ) n −m , ⎝ m ⎠ Θ = [0,1] – Exponential: Each data point x is distributed according to p( x | θ ) = θ e−θ x , 18 March 2016 SIV 813 Θ = (0, ∞) 12 Maximum Likelihood and Bayesian Parameter Estimation cont. • Two procedures for parameter estimation will be considered: Ø Maximum likelihood estimation: choose parameter value $q that makes the data most probable (i.e., maximizes the probability of obtaining the sample that has actually been observed), p(x | D) = p(x | θ$(D)), θ$(D) = arg max p(D | θ ) θ Ø Bayesian learning: define a prior probability on the model space p (θ ) and compute the posterior p(θ | D). Additional samples sharp the posterior density which peaks near the true values of the parameters . 18 March 2016 SIV 813 13 Sampling Model • It is assumed that a sample set S = {(xl , ωl ) : l = 1,..., N} with independently generated samples is available. • The sample set is partitioned into separate sample sets for each class, D j = {xl : (xl , ωl ) ∈ D} • A generic sample set will simply be denoted by D . • Each class-conditional p (x | ω j ) is assumed to have a known parametric form and is uniquely specified by a parameter (vector) θ j . • Samples in each set D j are assumed to be independent and identically distributed (i.i.d.) according to some true probability law p (x | ω j ) . 18 March 2016 SIV 813 14 Log-Likelihood function and Score Function • The sample sets are assumed to be functionally independent, i.e., the training set S j contains no information about θi for i ≠ j . • The i.i.d. assumption implies that p(D j |θ j ) = ∏ p(x | θ j ) x∈D j • Let D be a generic sample of size n ≡ |D | . • Log-likelihood function: n l (θ; D) ≡ ln p(D|θ) = ∑ ln p(x k | θ) k =1 • The log-likelihood function is identical to the logarithm of the probability density function, but is interpreted as a function over the sample space for given parameter θ. 18 March 2016 SIV 813 15 Log-Likelihood Illustration • Assume that all the points in D are drawn from some (onedimensional) normal distribution with some (known) variance and unknown mean. 18 March 2016 SIV 813 16 Log-Likelihood function and Score Function cont. • Maximum likelihood estimator (MLE): θ$(D) = arg max l (θ; D) 14 2 43 θ ∈Θ (tacitly assuming that such a maximum exists!) • Score function: ∂l (θ; D) U k (θ; D) ≡ 1≤ k ≤ p ∂θ k and hence U(θ;D) ≡ ∇θ l (θ;D) • Necessary condition for MLE (if not on border of domain Θ ): U(θ; D) = 0 18 March 2016 SIV 813 17 Maximum A Posteriory • Maximum a posteriory (MAP): Find the value of θ that maximizes l(θ)+ln(p(θ)), where p(θ),is a prior probability of different parameter values.A MAP estimator finds the peak or mode of a posterior. Drawback of MAP: after arbitrary nonlinear transformation of the parameter space, the density will change, and the MAP solution will no longer be correct. 18 March 2016 SIV 813 18 Maximum A-Posteriori (MAP) Estimation " The “most likely value” is given by θ (n) p ( θ ) p ( X |θ ) (n) 0 $ θ = arg max p(θ | X ) = arg max θ θ p( X ( n ) ) n = arg max θ 18 March 2016 p0 (θ )∏ p( xi | θ ) i =1 (n) p ( X | θ ') p0 (θ ')dθ ' ∫ SIV 813 19 Maximum A-Posteriori (MAP) Estimation p( X (n) n | θ ) = ∏ p( xi | θ ) i =1 since the data is i.i.d. • We can disregard the normalizing factor p( X ( n ) ) when looking for the maximum 18 March 2016 SIV 813 20 MAP - continued So, the θ$ we are looking for is n ⎡ ⎤ $ θ = arg max ⎢ p0 (θ )∏ p( xi | θ )]⎥ θ i =1 ⎣ ⎦ ( log is monotonically increasing) n ⎛ ⎡ ⎤ ⎞ = arg max ⎜ log ⎢ p0 (θ )∏ p( xi | θ )]⎥ ⎟ θ i =1 ⎣ ⎦ ⎠ ⎝ n ⎛ ⎞ = arg max ⎜ log p0 (θ ) + log ∏ p ( xi | θ ) ⎟ θ i =1 ⎝ ⎠ n ⎛ ⎞ = arg max ⎜ log p0 (θ ) + ∑ log p ( xi | θ ) ⎟ θ i =1 ⎝ ⎠ 18 March 2016 SIV 813 21 The Gaussian Case: Unknown Mean • Suppose that the samples are drawn from a multivariate normal population with mean µ , and covariance matrix Σ. • Consider fist the case where only the mean is unknown θ =µ. • For a sample point xk , we have 1 1 d ⎡ ⎤ ln P(xk | µ) = − ln ⎣(2π ) | Σ |⎦ − (x k − µ)t Σ −1 ( x k − µ) 2 2 and ∇µ ln P(xk | µ) = Σ−1 (xk − µ) • The maximum likelihood estimate for µ must satisfy 18 March 2016 SIV 813 22 The Gaussian Case: Unknown Mean n −1 Σ ∑ (xk − µˆ ) = 0 • Multiplying by k =1 Σ , and rearranging, we obtain 1 n µˆ = ∑ x k n k =1 • The MLE estimate for the unknown population mean is just the arithmetic average of the training samples (sample mean). • Geometrically, if we think of the n samples as a cloud of points, the sample mean is the centroid of the cloud 18 March 2016 SIV 813 23 The Gaussian Case: Unknown Mean and Covariance • In the general multivariate normal case, neither the mean nor the covariance matrix is known θ = [µ, Σ] . • Consider fist the univariate case with θ1 = µ and θ 2 = σ 2 . The log-likelihood of a single point is 1 1 ln p(x k | θ) = − ln 2πθ 2 − (x k − θ1 ) 2 2 2θ 2 and its derivative is 1 ⎡ ⎤ ⎢ θ ( xk − θ1 ) ⎥ 2 ⎢ ⎥ ∇θl = ∇θ ln p( xk | θ) = ⎢ 1 ( xk − θ1 ) 2 ⎥ + ⎢ − ⎥ 2 2 θ 2 θ ⎣ 2 2 ⎦ 18 March 2016 SIV 813 24 The Gaussian Case: Unknown Mean and Covariance • Setting the gradient to zero, and using all the sample points, we get the following necessary conditions: n 1 ˆ ) = 0 and ( x − θ ∑ k 1 ˆ k =1 θ 2 n ( xk − θˆ1 )2 1 −∑ +∑ =0 2 ˆ k =1 θˆ k =1 θ n 2 2 • where θˆ1 = µˆ and θˆ2 = σˆ 2 , are the MLE estimates for θˆ1 , and θˆ2 respectively. • Solving for µˆ and σˆ 2, we obtain 1 n 1 n 2 µˆ = ∑ xk and σˆ = ∑ ( xk − µˆ ) 2 n k =1 n k =1 18 March 2016 SIV 813 25 The Gaussian multivariate case • For the multivariate case, it is easy to show that the MLE estimates for are given by n 1 n 1 µˆ = ∑ x k and Σˆ = ∑ (x k − µˆ )(x k − µˆ )t n k =1 n k =1 • The MLE for the mean vector is the sample mean, and the MLE estimate for the covariance matrix is the arithmetic t average of the n matrices (x k − µˆ )(x k − µˆ ) • The MLE for σ 2 is biased (i.e., the expected value over all data sets of size n of the sample variance is not equal to the true variance: n −1 2 ⎡ 1 n 2 ⎤ E ⎢ ∑ ( xi − µˆ ) ⎥ = σ ≠σ2 n ⎣ n i =1 ⎦ 18 March 2016 SIV 813 26 The Gaussian multivariate case • Unbiased estimator for µ and Σ are given by 1 n µˆ = ∑ x k n k =1 and 1 n t ˆ ˆ C= ( x − µ )( x − µ ) ∑ k k n − 1 k =1 C is called the sample covariance matrix . C is absolutely unbiased. σˆ 2 is asymptotically unbiased. 18 March 2016 SIV 813 27 Bayesian Estimation: Class-Conditional Densities • The aim is to find posteriors P(ωi|x) knowing p(x|ωi) and P(ωi), but they are unknown. How to find them? • Given the sample D, we say that the aim is to find P(ωi|x, D) • Bayes formula gives: P (ωi | x, D) = p (x | ωi , D) P(ωi | D) c ∑ p(x | ω , D) P(ω i i . | D) j =1 • We use the information provided by training samples to determine the class conditional densities and the prior probabilities. • Generally used assumptions: – Priors generally are known or obtainable from a trivial calculations. Thus P(ωi)= P(ωi|D). – The training set can be separated into c subsets: D1,…,Dc 18 March 2016 SIV 813 28 Bayesian Estimation: Class-Conditional Densities – The samples Dj have no influence on p(x|ωi,Di ) if • Thus we can write: P (ωi | x, D) = p (x | ωi , Di ) P(ωi ) i≠ j . c ∑ p(x | ω , D ) P(ω ) j j j j =1 • We have c separate problems of the form: Use a set D of samples drawn independently according to a fixed but unknown probability distribution p(x) to determine p(x|D). 18 March 2016 SIV 813 29 Bayesian Estimation: General Theory • Bayesian leaning considers θ (the parameter vector to be estimated) to be a random variable. Before we observe the data, the parameters are described by a prior p(θ ) which is typically very broad. Once we observed the data, we can make use of Bayes’ formula to find posterior p(θ |D ). Since some values of the parameters are more consistent with the data than others, the posterior is narrower than prior. This is Bayesian learning (see fig.) 18 March 2016 SIV 813 30 General Theory cont. • Density function for x, given the training data set D , p(x | D) = ∫ p(x,θ | D)dθ • From the definition of conditional probability densities p(x,θ | D) = p(x | θ , D) p(θ | D). • The first factor is independent of D since it just our assumed form p(x | θ , D ) ⇒ p(x | θ ) for parameterized density. • Therefore p(x | D) = ∫ p(x | θ) p(θ | D)dθ • Instead of choosing a specific value for θ , the Bayesian approach performs a weighted average over all values of θ . The weighting factor p(θ | D) , which is a posterior of θ is determined by starting from some assumed prior p (θ ) 18 March 2016 SIV 813 31 General Theory cont. • Then update it using Bayes’ formula to take account of data set D . Since D = {x1 ,..., x N } are drawn independently N p(D | θ ) = n p ( x ∏ |θ ) , (∗) n =1 which is likelihood function. • Posterior for θ is p(D |θ ) p(θ ) p(θ ) N n p(θ | D) = = p ( x ∏ |θ ) , p(D) p(D) (**) n =1 where normalization factor N p(D) = ∫ p(θ ')∏ p( x n |θ ')dθ ', n =1 18 March 2016 SIV 813 32 Bayesian Learning – Univariate Normal Distribution • Let us use the Bayesian estimation technique to calculate a posteriori density p(θ | D) and the desired probability density p(x | D) for the case p(x | µ) ~ N (µ, Σ) Ø Univariate Case: p ( µ | D) Let µ be the only unknown parameter p( x | µ ) ~ N ( µ , σ 2 ) 18 March 2016 SIV 813 33 Bayesian Learning – Univariate Normal Distribution • Prior probability: normal distribution over µ , p( µ ) ~ N ( µ0 , σ 02 ) µ 0 encodes some prior knowledge about the true mean µ , 2 while σ 0 measures our prior uncertainty. • If µ is drawn from p(µ) then density for x is completely determined. Letting D = {x1 ,..., xn } we use p( µ | D) = p(D | µ ) p( µ ) ∫ p(D |µ ) p(µ )d µ n = α ∏ p( xk | µ ) p( µ ) k =1 18 March 2016 SIV 813 34 Bayesian Learning – Univariate Normal Distribution • Computing the posterior distribution p ( µ | D) ∝ p (D | µ ) p( µ ) ⎡ 1 ⎛ n ⎛ x − µ ⎞2 ⎛ µ − µ ⎞ 2 ⎞ ⎤ 0 ⎟ ⎥ = α 'exp ⎢ − ⎜ ∑ ⎜ k + ⎜ ⎟ ⎟ ⎢ 2 ⎜⎝ k =1 ⎝ σ ⎠ ⎝ σ 0 ⎠ ⎟⎠ ⎥ ⎣ ⎦ ⎡ 1 ⎡⎛ n ⎛ 1 1 ⎞ 2 = α ''exp ⎢ − ⎢⎜ 2 + 2 ⎟ µ − 2 ⎜ 2 σ 0 ⎠ ⎢⎣ 2 ⎣⎝ σ ⎝ σ 18 March 2016 SIV 813 µ0 ⎞ ⎤ ⎤ xk + 2 ⎟ µ ⎥ ⎥ ∑ σ 0 ⎠ ⎦ ⎥⎦ k =1 n 35 Bayesian Learning – Univariate Normal Distribution • Where factors that do not depend on µ have been absorbed into the constants α ' and '' p ( µ | D) • is an exponential function of a quadratic function of i.e. it is a normal density. • remains normal for any number of p ( µ | D) training samples. • If we write α µ 2 ⎡ 1 1 ⎛ µ − µn ⎞ ⎤ p( µ | D) = exp ⎢ − ⎜ ⎟ ⎥ 2πσ n ⎢⎣ 2 ⎝ σ n ⎠ ⎥⎦ then identifying the coefficients, we get 18 March 2016 SIV 813 36 Bayesian Learning – Univariate Normal Distribution 1 σ 2 n = n σ where • 2 + 1 σ 02 µn µ0 n = 2 µˆ n + 2 2 σn σ σ0 1 n µˆ n = ∑ xk is the sample mean. n k =1 Solving explicitly for µn and σ n2 we obtain ⎛ nσ 02 ⎞ σ2 µ n = ⎜ 2 µˆ + 2 µ0 and 2 ⎟ n 2 nσ 0 + σ ⎝ nσ 0 + σ ⎠ • 2 2 σ σ σ n2 = 20 2 nσ 0 + σ µ n represents our best guess for µ after observing n samples. 2 • σ n measures our uncertainty about this guess. • σ n2 decreases monotonically with n (approaching σ 2 / n as n approaches infinity) 18 March 2016 SIV 813 37 Bayesian Learning – Univariate Normal Distribution • Each additional observation decreases our uncertainty about the true value of µ . • As n increases, p ( µ | D) becomes more and more sharply peaked, approaching a Dirac delta function as n approaches infinity. This behavior is known as Bayesian Learning. 18 March 2016 SIV 813 38 Bayesian Learning – Univariate Normal Distribution • In general, µ n is a linear combination of µˆ n and µ 0 , with coefficients that are non-negative and sum to 1. • Thus µ n lies somewhere between µˆ n and µ 0 . • If σ 0 ≠ 0 , µn → µˆ n as n → ∞ µ = µ 0 is so • If σ 0 = 0, our a priori certainty that strong that no number of observations can change our opinion. • If σ 0 ? σ , a priori guess is very uncertain, and we take µ n = µˆ n • The ratio σ 2 / σ 02 is called dogmatism. 18 March 2016 SIV 813 39 Bayesian Learning – Univariate Normal Distribution p( x | D) p( x | D) = ∫ p( x | µ )P( µ | D)d µ • The Univariate Case: =∫ ⎡ 1 ⎛ x − µ ⎞ 1 exp ⎢ − ⎜ ⎟ 2 σ 2πσ ⎠ ⎢⎣ ⎝ 2 ⎤ ⎥ ⎥⎦ 2 ⎡ 1 1 ⎛ µ − µn ⎞ ⎤ exp ⎢ − ⎜ ⎟ ⎥ d µ 2πσ n ⎢⎣ 2 ⎝ σ n ⎠ ⎥⎦ ⎡ 1 ( x − µn ) 2 ⎤ = exp ⎢ − f (σ , σ n ) 2 2 ⎥ 2πσσ n ⎣ 2 σ + σ n ⎦ 1 where 2 2 2 ⎡ 1 σ 2 + σ 2 ⎛ σ n x + σ µ n ⎞ ⎤ n f (σ , σ n ) = ∫ exp ⎢ − µ− ⎟ ⎥d µ 2 2 ⎜ 2 2 σ + σ n ⎠ ⎥ ⎢⎣ 2 σ σ n ⎝ ⎦ 18 March 2016 SIV 813 40 Bayesian Learning – Univariate Normal Distribution ⎡ 1 ( x − µn )2 ⎤ • Since p( x | D) ∝ exp ⎢ − 2 2 ⎥ we can write 2 σ + σ n ⎦ ⎣ 2 2 n p( x | D) : N ( µn , σ + σ ) • To obtain the class conditional probability p( x | D) , whose parametric form is known to be p( x | µ ) : N ( µ , σ ) we replace µ by µ n and σ 2 by σ 2 + σ n2 • The conditional mean µ n is treated as if it were the true mean, and the known variance is increased to account for the additional uncertainty in x resulting from our lack of exact knowledge of the mean µ . 18 March 2016 SIV 813 41 Example (demo-MAP) • We have N points which are generated by one dimensional Gaussian, p( x | µ ) = Gx [ µ ,1]. Since we think that the mean should not be very big we use as a prior p(µ ) = Gµ [0,α 2 ], where α is a hyperparameter. The total objective function is: 2 N µ E ∝ −∑ ( xn −µ ) − 2 α n =1 2 which is maximized to give, 1 µ= N+ 1 α 1 N ∑x n 2 n =1 For N ? 2 influence of prior is negligible and result is ML estimate. But α for very strong belief in the prior 1 ? N the estimate tends to zero. Thus, α2 if few data are available, the prior will bias the estimate towards the prior expected value 18 March 2016 SIV 813 42 Recursive Bayesian Incremental Learning n • We have seen that p(D | θ) = ∏ p( xk | θ) , Let us define D n = {x1 ,..., x n } k =1 n Then p(D | θ) = p( xn | θ) p(D n −1 | θ). • Substituting into p(θ | D), and using Bayes we have: n p (θ | D ) = p (D n | θ) p(θ) n ∫ p(D | θ) p(θ)dθ = p (x n | θ) p(D n −1 | θ) p(θ) n −1 p ( x | θ ) p (D | θ) p(θ) dθ n ∫ p(D n-1 ) p(x n |θ)p(θ|D ) p(θ) p(θ) = p(D n-1 ) n-1 ∫ p(x n |θ)p(θ|D ) p(θ) p(θ)dθ n-1 Finally p(θ|Dn )= 18 March 2016 p(x n |θ)p(θ|Dn-1 ) n-1 p(x |θ)p(θ|D )dθ n ∫ SIV 813 43 Recursive Bayesian Incremental Learning • While p(θ|D0 )=p(θ), repeated use of this eq. produces a sequence p(θ), p(θ | x1 ), p(θ | x1 , x1 ),... • • This is called the recursive Bayes approach to the parameter estimation. (Also incremental or on-line learning). • When this sequence of densities converges to a Dirac delta function centered about the true parameter value, we have Bayesian learning. 18 March 2016 SIV 813 44 Maximal Likelihood vs. Bayesian • ML and Bayesian estimations are asymptotically equivalent and “consistent”. They yield the same class-conditional densities when the size of the training data grows to infinity. • ML is typically computationally easier: in ML we need to do (multidimensional) differentiation and in Bayesian (multidimensional) integration. • ML is often easier to interpret: it returns the single best model (parameter) whereas Bayesian gives a weighted average of models. • But for a finite training data (and given a reliable prior) Bayesian is more accurate (uses more of the information). • Bayesian with “flat” prior is essentially ML; with asymmetric and broad priors the methods lead to different solutions. 18 March 2016 SIV 813 45 Problems of Dimensionality:Accuracy, Dimension, and Training Sample Size • Consider two-class multivariate normal distributions p(x | ωi ) : N (µi , Σ) with the same covariance. If priors are equal then Bayesian error rate is given by 1 P (e) = 2π where ∞ ∫e −u2 / 2 du, r/2 r 2 is the squared Mahalanobis distance: r 2 = (µ1 − µ 2 )t Σ −1 (µ1 − µ 2 ). • Thus the probability of error decreases as r increases. In the conditionally independent case Σ = diag (σ 12 ,..., σ d2 ) and d ⎛ µi1 − µi 2 ⎞ r = ∑ ⎜ ⎟ σ i =1 ⎝ i ⎠ 2 18 March 2016 SIV 813 2 46 Problems of Dimensionality • While classification accuracy can become better with growing of dimensionality (and an amount of training data), 18 March 2016 SIV 813 – beyond a certain point, the inclusion of additional features leads to worse rather then better performance – computational complexity grows – the problem of overfitting arises 47 Occam's Razor • "Pluralitas non est ponenda sine neccesitate" or "plurality should not be posited without necessity." The words are those of the medieval English philosopher and Franciscan monk William of Occam (ca. 1285-1349). Decisions based on overly complex models often lead to lower accuracy of the classifier. 18 March 2016 SIV 813 48 What is feature reduction? • Feature reduction refers to the mapping of the original high-dimensional data onto a lowerdimensional space. – Criterion for feature reduction can be different based on different problem settings. • Unsupervised setting: minimize the information loss • Supervised setting: maximize the class discrimination • Given a set of data points of p variables {x1 , x2 ,!, xn } Compute the linear transformation (projection) G ∈ ℜ p×d : x ∈ ℜ p → y = GT x ∈ ℜd (d << p) 18 March 2016 SIV 813 49 What is feature reduction? Original data reduced data Linear transformation T G ∈ℜ Y ∈ ℜd d× p X ∈ℜp G ∈ℜ 18 March 2016 p× d T : X → Y = G X ∈ℜ SIV 813 d 50