Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Welcome to the Kernel-Class • • • • My name: Max (Welling) Book: There will be class-notes/slides. Homework: reading material, some exercises, some MATLAB implementations. • I like: an active attitude in class. ask questions! respond to my questions. • Enjoy. 1 Primary Goal • What is the primary goal of: - Machine Learning - Data Mining - Pattern Recognition - Data Analysis - Statistics Automatic detection of non-coincidental structure in data. 2 Desiderata • Robust algorithms are insensitive to outliers and wrong model assumptions. • Stable algorithms: generalize well to unseen data. • Computationally efficient algorithms are necessary to handle large datasets. 3 Supervised & Unsupervised Learning • supervised: classification, regression • unsupervised: clustering, dimensionality reduction, ranking, outlier detection. f (x ) wT (x ) b • primal vs. dual problems: generalized linear models vs. kernel classification. f (x ) i K (x , xi ) i this is like nearest neighbor classification. 4 Feature Spaces : x ( x), R F d non-linear mapping to F 1. high-D space L2 2. infinite-D countable space : 3. function space (Hilbert space) example: ( x, y ) ( x , y , 2 xy) 2 2 5 Kernel Trick Note: In the dual representation we used the Gram matrix to express the solution. Kernel Trick: Replace : x ( x), kernel Gij xi , x j Gij ( xi ), ( x j ) K ( xi , x j ) If we use algorithms that only depend on the Gram-matrix, G, then we never have to know (compute) the actual features This is the crucial point of kernel methods 6 Properties of a Kernel Definition: A finitely positive semi-definite function k : x y R is a symmetric function of its arguments for which matrices formed by restriction on any finite subset of points is positive semi-definite. T K 0 Theorem: A function k : x y R can be written as k ( x, y ) ( x), ( y) where ( x) is a feature map x ( x) F iff k(x,y) satisfies the semi-definiteness property. Relevance: We can now check if k(x,y) is a proper kernel using only properties of k(x,y) itself, 7 i.e. without the need to know the feature map! Modularity Kernel methods consist of two modules: 1) The choice of kernel (this is non-trivial) 2) The algorithm which takes kernels as input Modularity: Any kernel can be used with any kernel-algorithm. some kernels: k ( x, y ) e( || x y|| 2 / c) k ( x, y ) ( x, y ) d k ( x, y ) tanh( x, y ) 1 k ( x, y ) || x y ||2 c 2 some kernel algorithms: - support vector machine - Fisher discriminant analysis - kernel regression - kernel PCA - kernel CCA 8 Goodies and Baddies Goodies: •Kernel algorithms are typically constrained convex optimization problems solved with either spectral methods or convex optimization tools. • Efficient algorithms do exist in most cases. • The similarity to linear methods facilitates analysis. There are strong generalization bounds on test error. Baddies: • You need to choose the appropriate kernel • Kernel learning is prone to over-fitting • All information must go through the kernel-bottleneck. 9 Regularization • regularization is very important! • regularization parameters determined by out of sample. measures (cross-validation, leave-one-out). Demo Trevor Hastie. 10 Learning Kernels • All information is tunneled through the Gram-matrix information bottleneck. • The real art is to pick an appropriate kernel. 2 e.g. take the RBF kernel: k ( x, y) e( || x y|| / c ) if c is very small: G=I (all data are dissimilar): over-fitting if c is very large: G=1 (all data are very similar): under-fitting We need to learn the kernel. Here is some ways to combine kernels to improve them: k1 cone k1 ( x, y ) k2 ( x, y ) k ( x, y ) , 0 k2 k ( x, y ) k ( x , y ) k ( x , y ) 1 2 k1 (( x), ( y )) k ( x, y ) any positive polynomial 11