Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
CS 59000 Statistical Machine learning Lecture 12 Yuan (Alan) Qi Outline • Review of Laplace approximation, BIC, Bayesian logistic regression • Kernel methods • Kernel ridge regression • Kernel construction • Kernel principle component analysis Laplace Approximation for Posterior Gaussian approximation around mode: Evidence Approximation Bayesian Information Criterion Approximation of Laplace approximation: More accurate evidence approximation needed Bayesian Logistic Regression Kernel Methods Predictions are linear combinations of a kernel function evaluated at training data points. Kernel function <-> feature space mapping Linear kernel: Stationary kernels: Fast Evaluation of Inner Product of Feature Mappings by Kernel Functions Inner product needs computing six feature values and 3 x 3 = 9 multiplications Kernel function has 2 multiplications and a squaring Kernel Trick 1. Reformulate an algorithm such that input vector enters only in the form of inner product . 2. Replace input x by its feature mapping: 3. Replace the inner product by a Kernel function: Examples: Kernel PCA, Kernel Fisher discriminant, Support Vector Machines Dual Representation for Ridge Regression Dual variables: Kernel Ridge Regression Using kernel trick: Now the cost function depends on input only through the Gram matrix. Kernel Ridge Regression Equivalent cost function over dual variables: Minimize over dual variables: Constructing Kernel function Example: Gaussian kernel Consider Gaussian kernel: Why is it a valid kernel? Example: Gaussian kernel Consider Gaussian kernel: Why is it a valid kernel? Generalization: Combining Generative & Discriminative Models by Kernels Since each modeling approach has distinct advantages, how to combine them? • Use generative models to construct kernels • Use these kernels in discriminative approaches Measure Probability Similarity by Kernels Simple inner product: For mixture distribution: For infinite mixture models: For models with latent variables (e.g,. Hidden Markov Models:) Fisher Kernels Fisher Score: Fisher Information Matrix: Fisher Kernel: Sample Average: Principle Component Analysis (PCA) Assume We have is a normalized eigenvector: Feature Mapping Eigen-problem in feature space Dual Variables Suppose (why it cannot be smaller than 0?), we have Eigen-problem in Feature Space (1) Multiplying both sides by , we obtain Eigen-problem in Feature Space (2) Normalization condition: Projection coefficient: General Case for Non-zero Mean Case Kernel Matrix: Kernel PCA on Synthetic Data Contour plots of projection coefficients in feature space Limitations of Kernel PCA Discussion… Limitations of Kernel PCA If N is big, it is computationally expensive since K is N by N while S is D by D. Not easy for low-rank approximation.