Download Kernel Methods

Kernel Methods A B M Shawkat Ali 1 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable information  crucial decisions ¤ Approach Train Data Model Test Data 2 crucial decisions History of SVM • The original optimal hyperplane algorithm proposed by Vladimir Vapnik in 1963 was a linear classifier. • However, in 1992, Bernhard Boser, Isabelle Guyon and Vapnik suggested a way to create non-linear classifiers by applying the kernel trick (originally proposed by Aizerman et al.) to maximummargin hyperplanes. The resulting algorithm is formally similar, except that every dot product is replaced by a non-linear kernel function. This allows the algorithm to fit the maximum-margin hyperplane in a transformed feature space. The transformation may be non-linear and the transformed space high dimensional; thus though the classifier is a hyperplane in the high-dimensional feature space, it may be non-linear in the original input space. Property of the SVM ¤ Relatively new approach ¤ Lot of interest recently:  Many successes, e.g., text classification ¤ Important concepts:  Transformation into high dimensional space  Finding a "maximal margin" separation  Structural risk minimization rather than Empirical risk minimization 4 Support Vector Machine (SVM) ¤ Classification  Grouping of similar data. ¤ Regression  Prediction by historical knowledge. ¤ Novelty Detection  To detect abnormal instances from a dataset. ¤ Clustering, Feature Selection 5 SVM Block Diagram              Non linear Mapping by Kernel Training Data Domain                     Linear Feature Space of SVM 6              To Choose Optimal Hyperplane SVM Block Diagram         Constructed Model through Feature knowledge Class I         Class II Kernel Mapping             Test Data Domain 7 SVM Formulation y  sign (w  X  b) 1 w yi (w  Xi  b)  1 min : w w yi (w  Xi  b)  1, i  S w   i yi Xi iS y  sign ( i yi Xi X  b) iS 8 SVM Formulation y  sign (w  X  b) min : w ,b 1 2 w  C  1  yi (w  Xi  b) 2 i yi (w  Xi  b)  1, i  S w   i yi Xi iS y  sign ( i yi Xi X  b) iS 9 SVM Formulation  () X i   (x i ) X   ( x) x X X i  X   ( x i )   ( x) y  sign ( i yi (xi ) (x)  b) iS K (,) (xi )  (x)  K (xi , x) Mercer’s Condition y  sign ( i yi K (xi ,x)  b) iS 10 y  sign ( i yi Xi X  b) iS Types of Kernels Common kernels for SVM ¤ Linear ¤ Polynomial ¤ Radial Basis Function New kernels (not used in SVM) ¤ Laplace ¤ Multiquadratic 11 SVM kernel Linear K (xi , x)  xi  x Polynomial K ( x i , x )  ( k  x i  x) d Gaussian (Radial Basis Function) K (x i , x)  exp(  12 k ( x i .x i )  k ( x.x )  2 k ( x i .x ) 2 2 ) Laplace kernel Introduced by Pavel Paclik et. al. in Pattern Recognition letters 21 (2000) Laplace Kernel based on Laplace Probability Density   x  x 1 j ij  ˆf ( x |  )  1 exp    N i 1 j 1 2hc hc  N Smoothing Parameter (Sp) 13 D     Linear Kernel 14 The reality of data separation RBF kernel 16 XOR solved by SVM Table 5.3. Boolean XOR Problem Input data x (-1,-1) Output class y -1 (-1,+1) +1 (+1,-1) +1 (+1,+1) -1 • First, we transform the dataset by polynomial kernel as: K (x i , x j )  (1  x i  x j ) T Here, xi .x j T 2  1 1  1 1  1 1 1 1     1 1  1 1 1 1   1 1   , Therefore the kernel matrix is: 9 1 K (xi , x j )   1  1 1 9 1 1 1 1 9 1 1 1  1  9 We can write the maximization term following SVM implementation given in Figure 5.20 as: 4 1 4 4 o i   i    i j yi y j K  xi  x j  2 i 1 j 1 i 1 1          (9 2  2   2   2   9 2  1 2 3 4 2 1 1 2 1 3 1 4 2 2   2   9 2  2   9 2 ) 2 3 2 4 3 3 4 4 4 subject to: y i 1 i i   1   2   3   4  0 0  1 , 0   2 , 0  3 , 0   4 . 9 1   2   3   4  1    9      1  1 2 3 4    1   2  9 3   4  1  1   2   3  9 4  1 By solving these above equations we can write the solution to this optimisation problem as: 1   2   3   4  1 8 Therefore, the decision function in the inner product representation is: n 4 i 1 i 1 2 fˆ x     i y i K x i , x   0.125 y i x i  x   1 The 2nd degree polynomial kernel function: K (xi , x j )  (( xi , x j )  1) 2  ( xi1 x j1  xi 2 x j 2 ) 2  2( xi1 x j1  xi 2 x j 2 )  1  1  ( xi1 x j1 ) 2  2( xi1 x j1 )( xi 2 x j 2 )  ( xi 2 x j 2 ) 2  2( xi1 x j1 )  2( xi 2 x j 2 )   ( x i )T ( x j ) Now we can write the 2nd degree polynomial transformation function as: (x i )  [1, xi1 , 2 xi1 xi 2 , xi 2 , 2 xi1 , 2 xi 2 ]T 2 2 4 o    y  (x ) i i i i 1  1 [(x1)  (x 2 )  (x3 )  (x 4 )] 8 =  1  1  1  1   0               1  1  1  1   0    2   2   2   2     1           1 / 2   8  1  1  1  1   0                2    2   2   2   0     2   2    2   2   0            1   2   x1       2 x1 x 2  1 ,0,0,0 2 0 0,0,   x 2   2    2 x1     2 x 2  Therefore the optimal hyperplane function for this XOR problem is: fˆ (x)   x1 x 2 Conclusions • Research Issues – How to select a kernel automatically – How to select optimal parameter values for kernel

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Kernel Methods