Download Lecture 7: Kernels for Classification and Regression - CS 194

CS 194-10, F’11 Lect. 7 Motivations Linear classification and regression Examples Lecture 7: Kernels for Classification and Regression CS 194-10, Fall 2011 Generic form The kernel trick Linear case Nonlinear case Examples Polynomial kernels Laurent El Ghaoui EECS Department UC Berkeley September 15, 2011 Other kernels Kernels in practice Outline CS 194-10, F’11 Lect. 7 Motivations Motivations Linear classification and regression Examples Generic form Linear classification and regression Examples Generic form The kernel trick Linear case Nonlinear case Examples Polynomial kernels Other kernels Kernels in practice The kernel trick Linear case Nonlinear case Examples Polynomial kernels Other kernels Kernels in practice Outline CS 194-10, F’11 Lect. 7 Motivations Motivations Linear classification and regression Examples Generic form Linear classification and regression Examples Generic form The kernel trick Linear case Nonlinear case Examples Polynomial kernels Other kernels Kernels in practice The kernel trick Linear case Nonlinear case Examples Polynomial kernels Other kernels Kernels in practice A linear regression problem Linear auto-regressive model for time-series: yt linear function of yt−1 , yt−2 yt = w1 + w2 yt−1 + w3 yt−2 , t = 1, . . . , T . CS 194-10, F’11 Lect. 7 Motivations Linear classification and regression Examples This writes yt = w T xt , with xt the “feature vectors” Generic form The kernel trick xt := (1, yt−1 , yt−2 ) , t = 1, . . . , T . Linear case Nonlinear case Examples Polynomial kernels Other kernels Kernels in practice Model fitting via least-squares: min kX T w − yk22 w Prediction rule : ŷT +1 = w1 + w2 yT + w3 yT −1 = w T xT +1 . Nonlinear regression Nonlinear auto-regressive model for time-series: yt quadratic function of yt−1 , yt−2 2 2 yt = w1 + w2 yt−1 + w3 yt−2 + w4 yt−1 + w5 yt−1 yt−2 + w6 yt−2 . CS 194-10, F’11 Lect. 7 Motivations Linear classification and regression Examples Generic form The kernel trick Linear case Nonlinear case Examples Polynomial kernels Other kernels Kernels in practice This writes yt = w T φ(xt ), with φ(xt ) the augmented feature vectors 2 2 φ(xt ) := 1, yt−1 , yt−2 , yt−1 , yt−1 yt−2 , yt−2 . Everything the same as before, with x replaced by φ(x). Nonlinear classification CS 194-10, F’11 Lect. 7 Motivations Linear classification and regression Examples Generic form The kernel trick Linear case Nonlinear case Examples Polynomial kernels Other kernels Kernels in practice Non-linear (e.g., quadratic) decision boundary w1 x1 + w2 x2 + w3 x12 + w4 x1 x2 + w5 x22 + b = 0. Writes w T φ(x) + b = 0, with φ(x) := (x1 , x2 , x12 , x1 x2 , x22 ). Challenges In principle, it seems can always augment the dimension of the feature space to make the data linearly separable. (See the video at http://www.youtube.com/watch?v=3liCbRZPrZA) CS 194-10, F’11 Lect. 7 Motivations Linear classification and regression Examples Generic form The kernel trick Linear case Nonlinear case Examples Polynomial kernels Other kernels Kernels in practice How do we do it in a computationally efficient manner? Outline CS 194-10, F’11 Lect. 7 Motivations Motivations Linear classification and regression Examples Generic form Linear classification and regression Examples Generic form The kernel trick Linear case Nonlinear case Examples Polynomial kernels Other kernels Kernels in practice The kernel trick Linear case Nonlinear case Examples Polynomial kernels Other kernels Kernels in practice Linear least-squares min kX T w − y k22 + λkwk22 w where CS 194-10, F’11 Lect. 7 Motivations Linear classification and regression Examples Generic form I X = [x1 , . . . , xn ] is the m × n matrix of data points. I y ∈ Rm is the “response” vector, I w contains regression coefficients. I λ ≥ 0 is a regularization parameter. The kernel trick Linear case Nonlinear case Examples Polynomial kernels Other kernels Kernels in practice Prediction rule: y = w T x, where x ∈ Rn is a new data point. Support vector machine (SVM) min w m X (1 − yi (w T xi + b)) + λkwk22 i=1 Motivations Linear classification and regression Examples Generic form where I CS 194-10, F’11 Lect. 7 X = [x1 , . . . , xm ] is the n × m matrix of data points in Rn . The kernel trick Linear case Nonlinear case m I y ∈ {−1, 1} is the label vector. I w, b contain classifer coefficients. Polynomial kernels I λ ≥ 0 is a regularization parameter. Kernels in practice In the sequel, we’ll ignore the bias term (for simplicity only). Classification rule: y = sign(w T x + b), where x ∈ Rn is a new data point. Examples Other kernels CS 194-10, F’11 Lect. 7 Generic form of problem Many classification and regression problems can be written T min L(X w, y ) + w λkwk22 Linear classification and regression Examples Generic form where I Motivations The kernel trick X = [x1 , . . . , xn ] is a m × n matrix of data points. Linear case Nonlinear case m I y ∈ R contains a response vector (or labels). I w contains classifier coefficients. I L is a “loss” function that depends on the problem considered. I λ ≥ 0 is a regularization parameter. Prediction/classification rule: depends only on w T x, where x ∈ Rn is a new data point. Examples Polynomial kernels Other kernels Kernels in practice CS 194-10, F’11 Lect. 7 Loss functions I Squared loss: (for linear least-squares regression) L(z, y ) = kz − y k22 . Motivations Linear classification and regression Examples I Hinge loss: (for SVMs) Generic form The kernel trick L(z, y ) = m X Linear case max(0, 1 − yi zi ) i=1 Nonlinear case Examples Polynomial kernels Other kernels I Logistic loss: (for logistic regression) L(z, y ) = − m X i=1 log(1 + e−yi zi ). Kernels in practice Outline CS 194-10, F’11 Lect. 7 Motivations Motivations Linear classification and regression Examples Generic form Linear classification and regression Examples Generic form The kernel trick Linear case Nonlinear case Examples Polynomial kernels Other kernels Kernels in practice The kernel trick Linear case Nonlinear case Examples Polynomial kernels Other kernels Kernels in practice CS 194-10, F’11 Lect. 7 Key result For the generic problem: Motivations T min L(X w) + w λkwk22 Linear classification and regression Examples the optimal w lies in the span of the data points (x1 , . . . , xm ): Generic form The kernel trick Linear case w = Xv for some vector v ∈ Rm . Nonlinear case Examples Polynomial kernels Other kernels Kernels in practice Proof R( A) = N ( A� )⊥ . Therefore, the output space R m is decomposed as Any w ∈ Rn can be written as( Athe of⊥ two Rm = R ) ⊕sum = R (orthogonal A ) ⊕ N ( A� )vectors: , R( A) and w = Xv + r dim N ( A� ) + rank( A ) = m, T where X T r = 0 (that is, r is in the nullspace N (X )). 3 see a graphical exemplification in R in Figure 2.2.2. CS 194-10, F’11 Lect. 7 Motivations Linear classification and regression Examples Generic form The kernel trick Linear case Nonlinear case2.26: Illus Figure of the fundamen Examples theorem of linea Polynomial kernels Algebra in R 3 ; A Other kernels [ a1 a2 ]. Kernels in practice Figure shows the case X = A = (a1 , a2 ). CS 194-10, F’11 Lect. 7 Consequence of key result For the generic problem: Motivations T min L(X w) + w λkwk22 the optimal w can be written as w = Xv for some vector v ∈ Rm . Linear classification and regression Examples Generic form The kernel trick Linear case Nonlinear case Examples Polynomial kernels Other kernels Kernels in practice Hence training problem depends only on K := X T X : min L(Kv ) + λv T Kv . v CS 194-10, F’11 Lect. 7 Kernel matrix The training problem depends only on the “kernel matrix” K = X T X Kij = xiT xj Motivations Linear classification and regression Examples K contains the scalar products between all data point pairs. Generic form The kernel trick Linear case Nonlinear case Examples Polynomial kernels Other kernels Kernels in practice The prediction/classification rule depends on the scalar products between new point x and the data points x1 , . . . , xm : w T x = v T X T x = v T k , k := X T x = (x T x1 , . . . , x T xm ). Computational advantages Once K is formed (this takes O(n)), then the training problem has only m variables. CS 194-10, F’11 Lect. 7 Motivations Linear classification and regression Examples Generic form The kernel trick Linear case Nonlinear case Examples Polynomial kernels Other kernels Kernels in practice When n >> m, this leads to a dramatic reduction in problem size. How about the nonlinear case? In the nonlinear case, we simply replace the feature vectors xi by some “augmented” feature vectors φ(xi ), with φ a non-linear mapping. CS 194-10, F’11 Lect. 7 Motivations Linear classification and regression Examples Generic form The kernel trick Linear case Nonlinear case Example : in classification with quadratic decision boundary, we use Examples Polynomial kernels φ(x) := (x1 , x2 , x12 , x1 x2 , x22 ). This leads to the modified kernel matrix Kij = φ(xi )T φ(xj ), 1 ≤ i, j ≤ m. Other kernels Kernels in practice CS 194-10, F’11 Lect. 7 The kernel function The kernel function associated with mapping φ is T k (x, z) = φ(x) φ(z). Motivations Linear classification and regression Examples It provides information about the metric in the feature space, e.g.: Generic form The kernel trick kφ(x) − φ(z)k22 = k (x, x) − 2k(x, z) + k(z, z). Linear case Nonlinear case Examples Polynomial kernels Other kernels Kernels in practice The computational effort involved in I solving the training problem; I making a prediction, depends only on our ability to quickly evaluate such scalar products. We can’t choose k arbitrarily; it has to satisfy the above for some φ. Outline CS 194-10, F’11 Lect. 7 Motivations Motivations Linear classification and regression Examples Generic form Linear classification and regression Examples Generic form The kernel trick Linear case Nonlinear case Examples Polynomial kernels Other kernels Kernels in practice The kernel trick Linear case Nonlinear case Examples Polynomial kernels Other kernels Kernels in practice CS 194-10, F’11 Lect. 7 Quadratic kernels Classification with quadratic boundaries involves feature vectors φ(x) = (1, x1 , x2 , x12 , x1 x2 , x22 ). Motivations Linear classification and regression Examples Generic form The kernel trick Linear case Nonlinear case Examples Polynomial kernels Other kernels Kernels in practice Fact : given two vectors x, z ∈ R2 , we have φ(x)T φ(z) = (1 + x T z)2 . Polynomial kernels More generally when φ(x) is the vector formed with all the products between the components of x ∈ Rn , up to degree d, then for any two vectors x, z ∈ Rn , φ(x)T φ(z) = (1 + x T z)d . Computational effort grows linearly in n. CS 194-10, F’11 Lect. 7 Motivations Linear classification and regression Examples Generic form The kernel trick Linear case Nonlinear case Examples Polynomial kernels Other kernels Kernels in practice This represents a dramatic reduction in speed over the “brute force” approach: I Form φ(x), φ(z); I evaluate φ(x)T φ(z). Computational effort grows as nd . CS 194-10, F’11 Lect. 7 Other kernels Gaussian kernel function: kx − zk22 , k (x, z) = exp − 2σ 2 Motivations Linear classification and regression Examples Generic form where σ > 0 is a scale parameter. Allows to ignore points that are too far apart. Corresponds to a non-linear mapping φ to infinite-dimensional feature space. The kernel trick Linear case Nonlinear case Examples Polynomial kernels Other kernels Kernels in practice There is a large variety (a zoo?) of other kernels, some adapted to structure of data (text, images, etc). In practice CS 194-10, F’11 Lect. 7 I Kernels need to be chosen by the user. I Choice not always obvious; Gaussian or polynomial kernels are popular. Linear classification and regression I Control over-fitting via cross validation (wrt say, scale parameter of Gaussian kernel, or degree of polynomial kernel). The kernel trick I Kernel methods not well adapted to l1 -norm regularization. Motivations Examples Generic form Linear case Nonlinear case Examples Polynomial kernels Other kernels Kernels in practice

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Lecture 7: Kernels for Classification and Regression - CS 194