Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
CS 194-10, F’11
Lect. 7
Motivations
Linear classification
and regression
Examples
Lecture 7: Kernels for Classification and Regression
CS 194-10, Fall 2011
Generic form
The kernel trick
Linear case
Nonlinear case
Examples
Polynomial kernels
Laurent El Ghaoui
EECS Department
UC Berkeley
September 15, 2011
Other kernels
Kernels in practice
Outline
CS 194-10, F’11
Lect. 7
Motivations
Motivations
Linear classification
and regression
Examples
Generic form
Linear classification and regression
Examples
Generic form
The kernel trick
Linear case
Nonlinear case
Examples
Polynomial kernels
Other kernels
Kernels in practice
The kernel trick
Linear case
Nonlinear case
Examples
Polynomial kernels
Other kernels
Kernels in practice
Outline
CS 194-10, F’11
Lect. 7
Motivations
Motivations
Linear classification
and regression
Examples
Generic form
Linear classification and regression
Examples
Generic form
The kernel trick
Linear case
Nonlinear case
Examples
Polynomial kernels
Other kernels
Kernels in practice
The kernel trick
Linear case
Nonlinear case
Examples
Polynomial kernels
Other kernels
Kernels in practice
A linear regression problem
Linear auto-regressive model for time-series: yt linear function of
yt−1 , yt−2
yt = w1 + w2 yt−1 + w3 yt−2 , t = 1, . . . , T .
CS 194-10, F’11
Lect. 7
Motivations
Linear classification
and regression
Examples
This writes yt = w T xt , with xt the “feature vectors”
Generic form
The kernel trick
xt := (1, yt−1 , yt−2 ) , t = 1, . . . , T .
Linear case
Nonlinear case
Examples
Polynomial kernels
Other kernels
Kernels in practice
Model fitting via least-squares:
min kX T w − yk22
w
Prediction rule : ŷT +1 = w1 + w2 yT + w3 yT −1 = w T xT +1 .
Nonlinear regression
Nonlinear auto-regressive model for time-series: yt quadratic function
of yt−1 , yt−2
2
2
yt = w1 + w2 yt−1 + w3 yt−2 + w4 yt−1
+ w5 yt−1 yt−2 + w6 yt−2
.
CS 194-10, F’11
Lect. 7
Motivations
Linear classification
and regression
Examples
Generic form
The kernel trick
Linear case
Nonlinear case
Examples
Polynomial kernels
Other kernels
Kernels in practice
This writes yt = w T φ(xt ), with φ(xt ) the augmented feature vectors
2
2
φ(xt ) := 1, yt−1 , yt−2 , yt−1
, yt−1 yt−2 , yt−2
.
Everything the same as before, with x replaced by φ(x).
Nonlinear classification
CS 194-10, F’11
Lect. 7
Motivations
Linear classification
and regression
Examples
Generic form
The kernel trick
Linear case
Nonlinear case
Examples
Polynomial kernels
Other kernels
Kernels in practice
Non-linear (e.g., quadratic) decision boundary
w1 x1 + w2 x2 + w3 x12 + w4 x1 x2 + w5 x22 + b = 0.
Writes w T φ(x) + b = 0, with φ(x) := (x1 , x2 , x12 , x1 x2 , x22 ).
Challenges
In principle, it seems can always augment the dimension of the
feature space to make the data linearly separable. (See the video at
http://www.youtube.com/watch?v=3liCbRZPrZA)
CS 194-10, F’11
Lect. 7
Motivations
Linear classification
and regression
Examples
Generic form
The kernel trick
Linear case
Nonlinear case
Examples
Polynomial kernels
Other kernels
Kernels in practice
How do we do it in a computationally efficient manner?
Outline
CS 194-10, F’11
Lect. 7
Motivations
Motivations
Linear classification
and regression
Examples
Generic form
Linear classification and regression
Examples
Generic form
The kernel trick
Linear case
Nonlinear case
Examples
Polynomial kernels
Other kernels
Kernels in practice
The kernel trick
Linear case
Nonlinear case
Examples
Polynomial kernels
Other kernels
Kernels in practice
Linear least-squares
min kX T w − y k22 + λkwk22
w
where
CS 194-10, F’11
Lect. 7
Motivations
Linear classification
and regression
Examples
Generic form
I
X = [x1 , . . . , xn ] is the m × n matrix of data points.
I
y ∈ Rm is the “response” vector,
I
w contains regression coefficients.
I
λ ≥ 0 is a regularization parameter.
The kernel trick
Linear case
Nonlinear case
Examples
Polynomial kernels
Other kernels
Kernels in practice
Prediction rule: y = w T x, where x ∈ Rn is a new data point.
Support vector machine (SVM)
min
w
m
X
(1 − yi (w T xi + b)) + λkwk22
i=1
Motivations
Linear classification
and regression
Examples
Generic form
where
I
CS 194-10, F’11
Lect. 7
X = [x1 , . . . , xm ] is the n × m matrix of data points in Rn .
The kernel trick
Linear case
Nonlinear case
m
I
y ∈ {−1, 1} is the label vector.
I
w, b contain classifer coefficients.
Polynomial kernels
I
λ ≥ 0 is a regularization parameter.
Kernels in practice
In the sequel, we’ll ignore the bias term (for simplicity only).
Classification rule: y = sign(w T x + b), where x ∈ Rn is a new data
point.
Examples
Other kernels
CS 194-10, F’11
Lect. 7
Generic form of problem
Many classification and regression problems can be written
T
min L(X w, y ) +
w
λkwk22
Linear classification
and regression
Examples
Generic form
where
I
Motivations
The kernel trick
X = [x1 , . . . , xn ] is a m × n matrix of data points.
Linear case
Nonlinear case
m
I
y ∈ R contains a response vector (or labels).
I
w contains classifier coefficients.
I
L is a “loss” function that depends on the problem considered.
I
λ ≥ 0 is a regularization parameter.
Prediction/classification rule: depends only on w T x, where x ∈ Rn is
a new data point.
Examples
Polynomial kernels
Other kernels
Kernels in practice
CS 194-10, F’11
Lect. 7
Loss functions
I
Squared loss: (for linear least-squares regression)
L(z, y ) = kz −
y k22 .
Motivations
Linear classification
and regression
Examples
I
Hinge loss: (for SVMs)
Generic form
The kernel trick
L(z, y ) =
m
X
Linear case
max(0, 1 − yi zi )
i=1
Nonlinear case
Examples
Polynomial kernels
Other kernels
I
Logistic loss: (for logistic regression)
L(z, y ) = −
m
X
i=1
log(1 + e−yi zi ).
Kernels in practice
Outline
CS 194-10, F’11
Lect. 7
Motivations
Motivations
Linear classification
and regression
Examples
Generic form
Linear classification and regression
Examples
Generic form
The kernel trick
Linear case
Nonlinear case
Examples
Polynomial kernels
Other kernels
Kernels in practice
The kernel trick
Linear case
Nonlinear case
Examples
Polynomial kernels
Other kernels
Kernels in practice
CS 194-10, F’11
Lect. 7
Key result
For the generic problem:
Motivations
T
min L(X w) +
w
λkwk22
Linear classification
and regression
Examples
the optimal w lies in the span of the data points (x1 , . . . , xm ):
Generic form
The kernel trick
Linear case
w = Xv
for some vector v ∈ Rm .
Nonlinear case
Examples
Polynomial kernels
Other kernels
Kernels in practice
Proof
R( A) = N ( A� )⊥ .
Therefore, the output space R m is decomposed as
Any w ∈ Rn can be written
as( Athe
of⊥ two
Rm = R
) ⊕sum
= R (orthogonal
A ) ⊕ N ( A� )vectors:
,
R( A)
and
w = Xv + r
dim N ( A� ) + rank( A ) = m,
T
where X T r = 0 (that is, r is in the nullspace
N (X )).
3
see a graphical exemplification in R in Figure 2.2.2.
CS 194-10, F’11
Lect. 7
Motivations
Linear classification
and regression
Examples
Generic form
The kernel trick
Linear case
Nonlinear
case2.26: Illus
Figure
of the fundamen
Examples
theorem of linea
Polynomial kernels
Algebra in R 3 ; A
Other kernels
[ a1 a2 ].
Kernels in practice
Figure shows the case X = A = (a1 , a2 ).
CS 194-10, F’11
Lect. 7
Consequence of key result
For the generic problem:
Motivations
T
min L(X w) +
w
λkwk22
the optimal w can be written as w = Xv for some vector v ∈ Rm .
Linear classification
and regression
Examples
Generic form
The kernel trick
Linear case
Nonlinear case
Examples
Polynomial kernels
Other kernels
Kernels in practice
Hence training problem depends only on K := X T X :
min L(Kv ) + λv T Kv .
v
CS 194-10, F’11
Lect. 7
Kernel matrix
The training problem depends only on the “kernel matrix” K = X T X
Kij =
xiT xj
Motivations
Linear classification
and regression
Examples
K contains the scalar products between all data point pairs.
Generic form
The kernel trick
Linear case
Nonlinear case
Examples
Polynomial kernels
Other kernels
Kernels in practice
The prediction/classification rule depends on the scalar products
between new point x and the data points x1 , . . . , xm :
w T x = v T X T x = v T k , k := X T x = (x T x1 , . . . , x T xm ).
Computational advantages
Once K is formed (this takes O(n)), then the training problem has only
m variables.
CS 194-10, F’11
Lect. 7
Motivations
Linear classification
and regression
Examples
Generic form
The kernel trick
Linear case
Nonlinear case
Examples
Polynomial kernels
Other kernels
Kernels in practice
When n >> m, this leads to a dramatic reduction in problem size.
How about the nonlinear case?
In the nonlinear case, we simply replace the feature vectors xi by
some “augmented” feature vectors φ(xi ), with φ a non-linear mapping.
CS 194-10, F’11
Lect. 7
Motivations
Linear classification
and regression
Examples
Generic form
The kernel trick
Linear case
Nonlinear case
Example : in classification with quadratic decision boundary, we use
Examples
Polynomial kernels
φ(x) := (x1 , x2 , x12 , x1 x2 , x22 ).
This leads to the modified kernel matrix
Kij = φ(xi )T φ(xj ), 1 ≤ i, j ≤ m.
Other kernels
Kernels in practice
CS 194-10, F’11
Lect. 7
The kernel function
The kernel function associated with mapping φ is
T
k (x, z) = φ(x) φ(z).
Motivations
Linear classification
and regression
Examples
It provides information about the metric in the feature space, e.g.:
Generic form
The kernel trick
kφ(x) − φ(z)k22 = k (x, x) − 2k(x, z) + k(z, z).
Linear case
Nonlinear case
Examples
Polynomial kernels
Other kernels
Kernels in practice
The computational effort involved in
I
solving the training problem;
I
making a prediction,
depends only on our ability to quickly evaluate such scalar products.
We can’t choose k arbitrarily; it has to satisfy the above for some φ.
Outline
CS 194-10, F’11
Lect. 7
Motivations
Motivations
Linear classification
and regression
Examples
Generic form
Linear classification and regression
Examples
Generic form
The kernel trick
Linear case
Nonlinear case
Examples
Polynomial kernels
Other kernels
Kernels in practice
The kernel trick
Linear case
Nonlinear case
Examples
Polynomial kernels
Other kernels
Kernels in practice
CS 194-10, F’11
Lect. 7
Quadratic kernels
Classification with quadratic boundaries involves feature vectors
φ(x) =
(1, x1 , x2 , x12 , x1 x2 , x22 ).
Motivations
Linear classification
and regression
Examples
Generic form
The kernel trick
Linear case
Nonlinear case
Examples
Polynomial kernels
Other kernels
Kernels in practice
Fact : given two vectors x, z ∈ R2 , we have
φ(x)T φ(z) = (1 + x T z)2 .
Polynomial kernels
More generally when φ(x) is the vector formed with all the products
between the components of x ∈ Rn , up to degree d, then for any two
vectors x, z ∈ Rn ,
φ(x)T φ(z) = (1 + x T z)d .
Computational effort grows linearly in n.
CS 194-10, F’11
Lect. 7
Motivations
Linear classification
and regression
Examples
Generic form
The kernel trick
Linear case
Nonlinear case
Examples
Polynomial kernels
Other kernels
Kernels in practice
This represents a dramatic reduction in speed over the “brute force”
approach:
I
Form φ(x), φ(z);
I
evaluate φ(x)T φ(z).
Computational effort grows as nd .
CS 194-10, F’11
Lect. 7
Other kernels
Gaussian kernel function:
kx − zk22
,
k (x, z) = exp −
2σ 2
Motivations
Linear classification
and regression
Examples
Generic form
where σ > 0 is a scale parameter. Allows to ignore points that are too
far apart. Corresponds to a non-linear mapping φ to
infinite-dimensional feature space.
The kernel trick
Linear case
Nonlinear case
Examples
Polynomial kernels
Other kernels
Kernels in practice
There is a large variety (a zoo?) of other kernels, some adapted to
structure of data (text, images, etc).
In practice
CS 194-10, F’11
Lect. 7
I
Kernels need to be chosen by the user.
I
Choice not always obvious; Gaussian or polynomial kernels are
popular.
Linear classification
and regression
I
Control over-fitting via cross validation (wrt say, scale parameter
of Gaussian kernel, or degree of polynomial kernel).
The kernel trick
I
Kernel methods not well adapted to l1 -norm regularization.
Motivations
Examples
Generic form
Linear case
Nonlinear case
Examples
Polynomial kernels
Other kernels
Kernels in practice