Download Kernel Methods

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Kernel Methods
A B M Shawkat Ali
1
Data Mining
¤ DM or KDD (Knowledge Discovery in
Databases)
Extracting previously unknown, valid, and
actionable information  crucial decisions
¤ Approach
Train Data
Model
Test Data
2
crucial decisions
History of SVM
• The original optimal hyperplane algorithm proposed by Vladimir
Vapnik in 1963 was a linear classifier.
• However, in 1992, Bernhard Boser, Isabelle Guyon and Vapnik
suggested a way to create non-linear classifiers by applying the
kernel trick (originally proposed by Aizerman et al.) to maximummargin hyperplanes. The resulting algorithm is formally similar,
except that every dot product is replaced by a non-linear kernel
function. This allows the algorithm to fit the maximum-margin
hyperplane in a transformed feature space. The transformation may
be non-linear and the transformed space high dimensional; thus
though the classifier is a hyperplane in the high-dimensional feature
space, it may be non-linear in the original input space.
Property of the SVM
¤ Relatively new approach
¤ Lot of interest recently:
 Many successes, e.g., text classification
¤ Important concepts:
 Transformation into high dimensional
space
 Finding a "maximal margin" separation
 Structural risk minimization rather than
Empirical risk minimization
4
Support Vector Machine (SVM)
¤ Classification
 Grouping of similar data.
¤ Regression
 Prediction by historical knowledge.
¤ Novelty Detection
 To detect abnormal instances from a
dataset.
¤ Clustering, Feature Selection
5
SVM Block Diagram

 

 







Non linear Mapping
by Kernel
Training Data Domain





 













Linear Feature Space of SVM
6
 











To Choose Optimal
Hyperplane
SVM Block Diagram
 
  
  
Constructed Model
through Feature
knowledge
Class I
 
 
  

Class II
Kernel Mapping
  
    


 
Test Data Domain
7
SVM Formulation
y  sign (w  X  b)
1
w
yi (w  Xi  b)  1
min : w
w
yi (w  Xi  b)  1, i  S
w   i yi Xi
iS
y  sign ( i yi Xi X  b)
iS
8
SVM Formulation
y  sign (w  X  b)
min :
w ,b
1
2
w  C  1  yi (w  Xi  b)
2
i
yi (w  Xi  b)  1, i  S
w   i yi Xi
iS
y  sign ( i yi Xi X  b)
iS
9
SVM Formulation
 ()
X i   (x i )
X   ( x)
x
X
X i  X   ( x i )   ( x)
y  sign ( i yi (xi ) (x)  b)
iS
K (,)
(xi )  (x)  K (xi , x)
Mercer’s Condition
y  sign ( i yi K (xi ,x)  b)
iS
10
y  sign ( i yi Xi X  b)
iS
Types of Kernels
Common kernels for SVM
¤ Linear
¤ Polynomial
¤ Radial Basis Function
New kernels (not used in SVM)
¤ Laplace
¤ Multiquadratic
11
SVM kernel
Linear
K (xi , x)  xi  x
Polynomial
K ( x i , x )  ( k  x i  x) d
Gaussian (Radial Basis Function)
K (x i , x)  exp( 
12
k ( x i .x i )  k ( x.x )  2 k ( x i .x )
2 2
)
Laplace kernel
Introduced by Pavel Paclik et. al. in Pattern
Recognition letters 21 (2000)
Laplace Kernel based on Laplace Probability
Density


x

x
1
j
ij

ˆf ( x |  )  1
exp 


N i 1 j 1 2hc
hc

N
Smoothing Parameter (Sp)
13
D




Linear Kernel
14
The reality of data separation
RBF kernel
16
XOR solved by SVM
Table 5.3. Boolean XOR Problem
Input data x
(-1,-1)
Output class y
-1
(-1,+1)
+1
(+1,-1)
+1
(+1,+1)
-1
• First, we transform the dataset by polynomial
kernel as:
K (x i , x j )  (1  x i  x j )
T
Here,
xi .x j T
2
 1 1
 1 1  1 1 1 1



 1 1  1 1 1 1


1
1


,
Therefore the kernel matrix is:
9
1
K (xi , x j )  
1

1
1
9
1
1
1
1
9
1
1
1

1

9
We can write the maximization term following SVM
implementation given in Figure 5.20 as:
4
1 4 4
o
i   i    i j yi y j K  xi  x j 
2 i 1 j 1
i 1
1
         (9 2  2   2   2   9 2 
1 2 3 4 2 1
1 2
1 3
1 4
2
2   2   9 2  2   9 2 )
2 3
2 4
3
3 4
4
4
subject to:
y
i 1
i
i
  1   2   3   4  0
0  1 , 0   2 , 0  3 , 0   4
.
9 1   2   3   4  1
   9      1
 1
2
3
4

  1   2  9 3   4  1
 1   2   3  9 4  1
By solving these above equations we can write the
solution to this optimisation problem as:
1   2   3   4 
1
8
Therefore, the decision function in the inner product
representation is:
n
4
i 1
i 1
2
fˆ x     i y i K x i , x   0.125 y i x i  x   1
The 2nd degree polynomial kernel function:
K (xi , x j )  (( xi , x j )  1) 2
 ( xi1 x j1  xi 2 x j 2 ) 2  2( xi1 x j1  xi 2 x j 2 )  1
 1  ( xi1 x j1 ) 2  2( xi1 x j1 )( xi 2 x j 2 )  ( xi 2 x j 2 ) 2  2( xi1 x j1 )  2( xi 2 x j 2 )
  ( x i )T ( x j )
Now we can write the 2nd degree polynomial
transformation function as:
(x i )  [1, xi1 , 2 xi1 xi 2 , xi 2 , 2 xi1 , 2 xi 2 ]T
2
2
4
o 
  y  (x )
i i
i
i 1

1
[(x1)  (x 2 )  (x3 )  (x 4 )]
8
=
 1
 1
 1
 1   0

 
 








 1
 1
 1
 1   0

  2   2   2   2   

1 


       1 / 2 

8  1
 1
 1
 1   0

 
 
 
   

   2    2   2   2   0

   2   2    2   2   0

 
 
   
 
1

 2

 x1




  2 x1 x 2 
1
,0,0,0 2
0
0,0,


x
2

 2


 2 x1 


 2 x 2 
Therefore the optimal hyperplane function for
this XOR problem is:
fˆ (x)   x1 x 2
Conclusions
• Research Issues
– How to select a kernel automatically
– How to select optimal parameter values for kernel
Related documents