Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Classification Problem • Given {( xn , yn )}nN1 - - - -- - - x2 - - xn q yn {, } - + + + + + + + + + + + + - - + ++ + + -x + 0 + + + + + + . x1 • Predict class label of a given query x 0 Classification Problem • Unknown probability distribution P( x, y ) • We need to estimate: P ( | x 0 ) f ( x0 ) P ( | x 0 ) f ( x0 ) The Bayesian Classifier • Loss function: j | k • Expected loss (conditional risk) associated with class j: J • Bayes rule: R( j | x ) j | k Pk | x k 1 j * arg min R( j | x ) 1 j J • Zero-one loss function: 0 if j k j | k 1 if j k j * arg max P( j | x ) Bayes rule 1 j J The Bayesian Classifier j * arg max P( j | x ) 1 j J • Bayes rule achieves the minimum error rate • How to estimate the posterior probabilities: P j | x J j 1 ˆj x arg max Pˆ ( j | x ) 1 j J Density estimation • Use Bayes theorem to estimate the posterior probability values: P( j | x ) p x | j P j J p x | k Pk k 1 p x | j class P j is the probability density function of j is the prior probability of class j x given Naïve Bayes Classifier • Makes the assumption of independence of features given the class: p x | j px1 , x2 ,, xq | j pxi | j q i 1 • The task of estimating a q-dimensional density function is reduced to the estimation of q one-dimensional density functions. Thus, the complexity of the task is drastically reduced. • The use of Bayes theorem becomes much simpler. • Proven to be effective in practice. Nearest-Neighbor Methods • Predict the class label of x0 as the most frequent one occurring in the K neighbors - - - -+ ++ + + -- - - - + ++ + + -- + + - + + x2 + + + - -+ + - + + - - - ++ ++ - . x1 Nearest-Neighbor Methods • Predict the class label of x0 as the most frequent one occurring in the K neighbors - - - -+ ++ + + -- - - - + ++ + + -- + + - + + x2 + + + - -- + + + - + + - - - ++ ++ - x1 Nearest-Neighbor Methods • Predict the class label of x0 as the most frequent one occurring in the K neighbors - - - -+ ++ + + -- - - - + ++ + + -- + + - + + x2 + + + - -+ + - + + - - - ++ ++ - . .. x1 Basic assumption: f ( x x) f ( x) f ( x x) f ( x) for small x Example: Letter Recognition First statistical moment . . . Edge count Asymptotic Properties of K-NN Methods lim N fˆj x f j ( x ) if lim N K and lim N K / N 0 • The first condition reduces the variance by making the estimation independent of the accidental characteristics of the K nearest neighbors. • The second condition reduces the bias by assuring that the K nearest neighbors are arbitrarily close to the query point. Asymptotic Properties of K-NN Methods lim N E1 2E E1 classification error rate of the 1-NN rule E classification error rate of the Bayes rule In the asymptotic limit no decision rule is more than twice as accurate as the 1-NN rule Finite-sample settings • How well the 1-NN rule works in finitesample settings? • If the number of training data N is large and the number of input features q is small, then the asymptotic results may still be valid. • However, for a moderate to large number of input variables, the sample required for their validity is beyond feasibility. Curse-of-Dimensionality • This phenomenon is known as the curse-of-dimensionality • It refers to the fact that in high dimensional spaces data become extremely sparse and are far apart from each other • It affects any estimation problem with high dimensionality Curse of Dimensionality DMAX/DMIN DMAX DMIN Sample of size N=500 uniformly distributed in [0, 1]q Curse of Dimensionality dim The distribution of the ratio DMAX/DMIN converges to 1 as the dimensionality increases Curse of Dimensionality dim Variance of distances from a given point Curse of Dimensionality dim The variance of distances from a given point converges to 0 as the dimensionality increases Curse of Dimensionality Distance values from a given point Values flatten out as dimensionality increases Computing radii of nearest neighborhoods median radius of a nearest neighborhood uniform distributi on in the unit cube - .5,.5 q Curse-of-Dimensionality As dimensionality increases, the distance from the closest point increases faster • Random sample of size N ~ uniform distribution in the q -dimensional unit hypercube • Diameter of a K 1neighborhood using Euclidean distance: d (q, N ) O( N 1/ q ) q N d(q,N) 4 4 6 6 10 10 100 1000 100 1000 1000 10000 10000 0.71 0.48 0.91 0.72 1.51 0.42 0.23 20 20 20 106 1010 1.20 0.76 Large d (q, N ) Highly biased estimations Curse-of-Dimensionality • It is a serious problem in many real-world applications • Microarray data: 3,000-4,000 genes; • Documents: 10,000-20,000 words in dictionary; • Images, face recognition, etc. How can we deal with the curse of dimensionality? 7.68 92.2 92.2 1912.5 x1 1 1 N x μ xi x2 2 N i 1 2 2 covariance matrix : E x μ x μ T x E 1 1 x1 1 , x2 2 x2 2 x1 1 2 E x1 1 x2 2 1 N x1 1 x2 2 2 x2 2 2 i x1 1 i i i 1 x1 1 x2 2 N x i 1 1 x2i 2 2 i x2 2 1 N 2 x1i 1 i i x x i 1 1 2 2 1 N x i 1 1 x2i 2 2 i x2 2 variance covariance 2 1 N i x1 1 N N i 1 i i 1 x x 1 1 2 2 N i 1 covariance 1 N x x N i 1 1 N i 1 N i 1 1 2 2 x2i 2 i 2 variance 0.94 0.93 0.93 1.03 0.97 0.49 0.49 1.04 0.93 0.01 0.01 1.05 0.99 0.5 0.5 1.06 1.04 1.05 1.05 1.15 Dimensionality Reduction • Many dimensions are often interdependent (correlated); We can: • Reduce the dimensionality of problems; • Transform interdependent coordinates into significant and independent ones;