Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Kernel Methods A B M Shawkat Ali 1 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable information crucial decisions ¤ Approach Train Data Model Test Data 2 crucial decisions History of SVM • The original optimal hyperplane algorithm proposed by Vladimir Vapnik in 1963 was a linear classifier. • However, in 1992, Bernhard Boser, Isabelle Guyon and Vapnik suggested a way to create non-linear classifiers by applying the kernel trick (originally proposed by Aizerman et al.) to maximummargin hyperplanes. The resulting algorithm is formally similar, except that every dot product is replaced by a non-linear kernel function. This allows the algorithm to fit the maximum-margin hyperplane in a transformed feature space. The transformation may be non-linear and the transformed space high dimensional; thus though the classifier is a hyperplane in the high-dimensional feature space, it may be non-linear in the original input space. Property of the SVM ¤ Relatively new approach ¤ Lot of interest recently: Many successes, e.g., text classification ¤ Important concepts: Transformation into high dimensional space Finding a "maximal margin" separation Structural risk minimization rather than Empirical risk minimization 4 Support Vector Machine (SVM) ¤ Classification Grouping of similar data. ¤ Regression Prediction by historical knowledge. ¤ Novelty Detection To detect abnormal instances from a dataset. ¤ Clustering, Feature Selection 5 SVM Block Diagram Non linear Mapping by Kernel Training Data Domain Linear Feature Space of SVM 6 To Choose Optimal Hyperplane SVM Block Diagram Constructed Model through Feature knowledge Class I Class II Kernel Mapping Test Data Domain 7 SVM Formulation y sign (w X b) 1 w yi (w Xi b) 1 min : w w yi (w Xi b) 1, i S w i yi Xi iS y sign ( i yi Xi X b) iS 8 SVM Formulation y sign (w X b) min : w ,b 1 2 w C 1 yi (w Xi b) 2 i yi (w Xi b) 1, i S w i yi Xi iS y sign ( i yi Xi X b) iS 9 SVM Formulation () X i (x i ) X ( x) x X X i X ( x i ) ( x) y sign ( i yi (xi ) (x) b) iS K (,) (xi ) (x) K (xi , x) Mercer’s Condition y sign ( i yi K (xi ,x) b) iS 10 y sign ( i yi Xi X b) iS Types of Kernels Common kernels for SVM ¤ Linear ¤ Polynomial ¤ Radial Basis Function New kernels (not used in SVM) ¤ Laplace ¤ Multiquadratic 11 SVM kernel Linear K (xi , x) xi x Polynomial K ( x i , x ) ( k x i x) d Gaussian (Radial Basis Function) K (x i , x) exp( 12 k ( x i .x i ) k ( x.x ) 2 k ( x i .x ) 2 2 ) Laplace kernel Introduced by Pavel Paclik et. al. in Pattern Recognition letters 21 (2000) Laplace Kernel based on Laplace Probability Density x x 1 j ij ˆf ( x | ) 1 exp N i 1 j 1 2hc hc N Smoothing Parameter (Sp) 13 D Linear Kernel 14 The reality of data separation RBF kernel 16 XOR solved by SVM Table 5.3. Boolean XOR Problem Input data x (-1,-1) Output class y -1 (-1,+1) +1 (+1,-1) +1 (+1,+1) -1 • First, we transform the dataset by polynomial kernel as: K (x i , x j ) (1 x i x j ) T Here, xi .x j T 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 , Therefore the kernel matrix is: 9 1 K (xi , x j ) 1 1 1 9 1 1 1 1 9 1 1 1 1 9 We can write the maximization term following SVM implementation given in Figure 5.20 as: 4 1 4 4 o i i i j yi y j K xi x j 2 i 1 j 1 i 1 1 (9 2 2 2 2 9 2 1 2 3 4 2 1 1 2 1 3 1 4 2 2 2 9 2 2 9 2 ) 2 3 2 4 3 3 4 4 4 subject to: y i 1 i i 1 2 3 4 0 0 1 , 0 2 , 0 3 , 0 4 . 9 1 2 3 4 1 9 1 1 2 3 4 1 2 9 3 4 1 1 2 3 9 4 1 By solving these above equations we can write the solution to this optimisation problem as: 1 2 3 4 1 8 Therefore, the decision function in the inner product representation is: n 4 i 1 i 1 2 fˆ x i y i K x i , x 0.125 y i x i x 1 The 2nd degree polynomial kernel function: K (xi , x j ) (( xi , x j ) 1) 2 ( xi1 x j1 xi 2 x j 2 ) 2 2( xi1 x j1 xi 2 x j 2 ) 1 1 ( xi1 x j1 ) 2 2( xi1 x j1 )( xi 2 x j 2 ) ( xi 2 x j 2 ) 2 2( xi1 x j1 ) 2( xi 2 x j 2 ) ( x i )T ( x j ) Now we can write the 2nd degree polynomial transformation function as: (x i ) [1, xi1 , 2 xi1 xi 2 , xi 2 , 2 xi1 , 2 xi 2 ]T 2 2 4 o y (x ) i i i i 1 1 [(x1) (x 2 ) (x3 ) (x 4 )] 8 = 1 1 1 1 0 1 1 1 1 0 2 2 2 2 1 1 / 2 8 1 1 1 1 0 2 2 2 2 0 2 2 2 2 0 1 2 x1 2 x1 x 2 1 ,0,0,0 2 0 0,0, x 2 2 2 x1 2 x 2 Therefore the optimal hyperplane function for this XOR problem is: fˆ (x) x1 x 2 Conclusions • Research Issues – How to select a kernel automatically – How to select optimal parameter values for kernel