Download KernelIntro

Welcome to the Kernel-Class • • • • My name: Max (Welling) Book: There will be class-notes/slides. Homework: reading material, some exercises, some MATLAB implementations. • I like: an active attitude in class. ask questions! respond to my questions. • Enjoy. 1 Primary Goal • What is the primary goal of: - Machine Learning - Data Mining - Pattern Recognition - Data Analysis - Statistics Automatic detection of non-coincidental structure in data. 2 Desiderata • Robust algorithms are insensitive to outliers and wrong model assumptions. • Stable algorithms: generalize well to unseen data. • Computationally efficient algorithms are necessary to handle large datasets. 3 Supervised & Unsupervised Learning • supervised: classification, regression • unsupervised: clustering, dimensionality reduction, ranking, outlier detection. f (x )  wT  (x )  b • primal vs. dual problems: generalized linear models vs. kernel classification. f (x )  i K (x , xi ) i this is like nearest neighbor classification. 4 Feature Spaces  : x  ( x), R  F d non-linear mapping to F 1. high-D space L2 2. infinite-D countable space : 3. function space (Hilbert space) example:  ( x, y )  ( x , y , 2 xy) 2 2 5 Kernel Trick Note: In the dual representation we used the Gram matrix to express the solution. Kernel Trick: Replace : x  ( x), kernel Gij  xi , x j  Gij  ( xi ), ( x j )  K ( xi , x j ) If we use algorithms that only depend on the Gram-matrix, G, then we never have to know (compute) the actual features  This is the crucial point of kernel methods 6 Properties of a Kernel Definition: A finitely positive semi-definite function k : x  y  R is a symmetric function of its arguments for which matrices formed by restriction on any finite subset of points is positive semi-definite.  T K  0  Theorem: A function k : x  y  R can be written as k ( x, y )  ( x), ( y)  where ( x) is a feature map x  ( x)  F iff k(x,y) satisfies the semi-definiteness property. Relevance: We can now check if k(x,y) is a proper kernel using only properties of k(x,y) itself, 7 i.e. without the need to know the feature map! Modularity Kernel methods consist of two modules: 1) The choice of kernel (this is non-trivial) 2) The algorithm which takes kernels as input Modularity: Any kernel can be used with any kernel-algorithm. some kernels: k ( x, y )  e( || x  y|| 2 / c) k ( x, y )  ( x, y   ) d k ( x, y )  tanh(  x, y   ) 1 k ( x, y )  || x  y ||2 c 2 some kernel algorithms: - support vector machine - Fisher discriminant analysis - kernel regression - kernel PCA - kernel CCA 8 Goodies and Baddies Goodies: •Kernel algorithms are typically constrained convex optimization problems  solved with either spectral methods or convex optimization tools. • Efficient algorithms do exist in most cases. • The similarity to linear methods facilitates analysis. There are strong generalization bounds on test error. Baddies: • You need to choose the appropriate kernel • Kernel learning is prone to over-fitting • All information must go through the kernel-bottleneck. 9 Regularization • regularization is very important! • regularization parameters determined by out of sample. measures (cross-validation, leave-one-out). Demo Trevor Hastie. 10 Learning Kernels • All information is tunneled through the Gram-matrix information bottleneck. • The real art is to pick an appropriate kernel. 2 e.g. take the RBF kernel: k ( x, y)  e( || x y|| / c ) if c is very small: G=I (all data are dissimilar): over-fitting if c is very large: G=1 (all data are very similar): under-fitting We need to learn the kernel. Here is some ways to combine kernels to improve them: k1 cone  k1 ( x, y )   k2 ( x, y )  k ( x, y )  ,   0 k2 k ( x, y ) k ( x , y )  k ( x , y ) 1 2 k1 (( x), ( y ))  k ( x, y ) any positive polynomial 11

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download KernelIntro