Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Kernel Methods Gaussian Processes Marco Trincavelli 5/12/2011 Mobile Robotics and Olfaction Lab AASS Research Centre, Örebro University State of the Art Methods of Data Modeling and Machine Learning, IMRIS program, Fall 2011 Acknowledgments These slides have been adapted from the slides created by Achim Lilienthal for an introductory seminar on Gaussian Processes. Achim Lilienthal 2 Kernel Methods: Gaussian Processes Repetition 1. Repetition 2. Weight-space view: Bayesian linear regression 3. Weight-space view: Bayesian nonlinear regression 4. Function-space view 3 Kernel Methods: Gaussian Processes SVM: Training and Predicting Training a SVM – maximizing the dual: Making predictions with a SVM: Where 4 is the set of support vectors. Kernel Methods: Gaussian Processes SVM: Training and Predicting Training a SVM – maximizing the dual: Making predictions with a SVM: Where 5 is the set of support vectors. Kernel Methods: Gaussian Processes Only dot products of the input both in the training and in the prediction phase! Dot product Kernel Trick If we can find kernel function such that: Then we don’t even have to know the mapping to solve the problem... This has two advantages: 1. Save a lot of computation by not having to compute the mapping and then train in the high dimensional space... 2. The data can be projected in a deliberately high dimensional space, even infinite... (we have to be careful with this!) 6 Kernel Methods: Gaussian Processes Example: from 2D to 3D Nonlinear Problem 7 Kernel Methods: Gaussian Processes Linear Problem Video: Mapping and Kernel Thanks to Udi Aharoni,YouTube link for the video: http://www.youtube.com/watch?v=3liCbRZPrZA 8 Kernel Methods: Gaussian Processes SVM with Gaussian Kernel The most common SVM Lattice search using cross-validation 9 Kernel Methods: Gaussian Processes What is a Gaussian Process? A Gaussian Process (GP) is a stochastic process, i.e. a generalization of a probability distribution to functions: probability distribution describes a finite-dimensional random variable. stochastic process describes a distribution over functions. Inference can be illustrated to take place in the space of the model´s parameters (weight-space view) or in the space of the functions (function-space view). GPs are a particularly effective method for placing a prior distribution over the space of functions or over the space of the model´s parameters. A Bayesian version of SVM, an infinitely large neural network. 10 Kernel Methods: Gaussian Processes What is a Gaussian Process? Formal definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution. The random variables represent the value of the function at location x. A GP is completely specified by its mean function and by its covariance function . Gaussian processes particularly refer to an infinite index set. 11 Kernel Methods: Gaussian Processes Weight-space view: Bayesian linear regression 1. Repetition 2. Weight-space view: Bayesian linear regression 3. Weight-space view: Bayesian nonlinear regression 4. Function-space view 12 Kernel Methods: Gaussian Processes Linear Regression We assume linear process: We use a linear model family . ...and the goal is to make 13 Kernel Methods: Gaussian Processes . Bayesian Linear Regression Compute the posterior distribution of the weight vector given a likelihood function for the observations (distribution of the noise ) and a prior distribution over : With normally distributed error and uniform prior over the weight Bayesian regression is equivalent to least-squares (SSE). With normally distributed error and normal prior over the weight (with zero mean) Bayesian regression is equivalent to ridge regression. 14 Kernel Methods: Gaussian Processes Computing the posterior over weights Likelihood of the observation (normally distributed error): Normal prior over the weights (normal distribution): 15 Kernel Methods: Gaussian Processes Computing the posterior over weights 16 Kernel Methods: Gaussian Processes Computing the posterior over weights Quadratic term 17 Kernel Methods: Gaussian Processes Linear term Constant term Completing the squares We are given a quadratic form defining the exponent terms in a Gaussian distribution, and we need to determine the corresponding mean and covariance. Exponent of general Gaussian distribution: Equate the coefficients: Quadratic term in : Linear term in : 18 Kernel Methods: Gaussian Processes Computing the posterior over weights Quadratic term: Linear term: Therefore the posterior of the weights is Gaussian: 19 Kernel Methods: Gaussian Processes Example: posterior over the weights Linear model, one input: Gaussian prior over the weights: w1 w0 20 Kernel Methods: Gaussian Processes Example: posterior over the weights We get observations. We calculate the likelihood of the observations assuming 21 Kernel Methods: Gaussian Processes : Example: posterior over the weights We get observations. We calculate the likelihood of the observations assuming : w1 w0 22 Kernel Methods: Gaussian Processes Example: posterior over the weights Compute the posterior of the weights using bayes rule: w1 Prior Likelihood w1 w0 23 Posterior Kernel Methods: Gaussian Processes w1 w0 w0 Make predictions for a query point Averaging the output of all possible linear models w.r.t. the Gaussian posterior over the model parameters (the weights) we obtain the predictive distribution: The predictive distribution is again Gaussian, with the following mean and covariance: 24 Kernel Methods: Gaussian Processes Example: predictions for query points The posterior distribution over weights induces a a predictive distribution over functions: Posterior over weights w1 w0 25 Kernel Methods: Gaussian Processes Predictive distribution over functions Weight-space view: Bayesian nonlinear regression 1. Repetition 2. Weight-space view: Bayesian linear regression 3. Weight-space view: Bayesian nonlinear regression 4. Function-space view 26 Kernel Methods: Gaussian Processes Towards nonlinear models The Bayesian linear model suffers from limited expressiveness. Project the inputs into some high dimensional space using a set of basis functions and apply the linear model in the high dimensional space. In the input space the model will be in general nonlinear. As we have already observed discussing about SVMs, if our model uses only dot products of the inputs we can avoid calculating the projection and work with kernels instead. 27 Kernel Methods: Gaussian Processes Making predictions in feature space Predictive distribution in input space: Introduce a function which maps a -dimensional input vector into an dimensional feature space ( ). Predictive distribution in feature space: 28 Kernel Methods: Gaussian Processes Predictive mean in feature space Consider the expression 29 Kernel Methods: Gaussian Processes : Predictive mean in feature space Multiply by left: Substitute in the expression of the mean : 30 Kernel Methods: Gaussian Processes from right and from Predictive covariance in feature space Apply the matrix inversion lemma on A: Set 31 Kernel Methods: Gaussian Processes : Dot products Notice that is a dot product. Indeed, since is positive definite, it is possible to find a Cholesky decomposition . Let us define a kernel (or covariance function): Where we used . In the equation of the predictive mean and covariance the feature space appears always in the form . We can remove it from the formulas using the kernel! 32 Kernel Methods: Gaussian Processes Predictive distribution Using the kernel expression we can re-write the formulas for the predictive mean and variance in the following way: 33 Kernel Methods: Gaussian Processes Function-space view 1. Repetition 2. Weight-space view: Bayesian linear regression 3. Weight-space view: Bayesian nonlinear regression 4. Function-space view 34 Kernel Methods: Gaussian Processes The function-space view A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution. The random variables represent the value of the function at location x. A GP is completely specified by its mean function by its covariance function 35 Kernel Methods: Gaussian Processes . and Bayesian linear regression revisited The Bayesian linear regression model with prior is a simple example of Gaussian Process. The mean and covariance of the process are: 36 Kernel Methods: Gaussian Processes Bayesian linear regression revisited The Bayesian linear regression model with prior is a simple example of Gaussian Process. The mean and covariance of the process are: Recall This is how we defined a kernel or covariance matrix 37 Kernel Methods: Gaussian Processes Mean and Covariance For notational simplicity we will take the value of the mean but this is in general not needed. The covariance function specifies the covariance between pairs of random variables: Note that the covariance between outputs is written as a function of the inputs. The specification of the covariance function implies a distribution over functions. 38 Kernel Methods: Gaussian Processes Covariance Functions Linear Squared Exponential Exponential 39 Kernel Methods: Gaussian Processes Drawing samples from a GP We will now draw samples from a Gaussian Process with a Squared Exponential covariance function. Here is the Matlab code to do that: l = 0.2; sig_f = 1; x = (-10:0.05:10)'; m_f = zeros(size(x)); [covInd1 covInd2] = meshgrid(x,x); cov_f = sig_f * exp(-(covInd1-covInd2).^2 ./ l.^2); cholCov_f = chol(cov_f + 1e-10 * eye(size(cov_f)),'lower'); uids = randn(size(m_f)); f = m_f + cholCov_f * uids; 40 Kernel Methods: Gaussian Processes Drawing samples from a GP We will now draw samples from a Gaussian Process with a Squared Exponential covariance function. Here is the Matlab code to do that: l = 0.2; sig_f = 1; Set the parameters of the covariance function x = (-10:0.05:10)'; m_f = zeros(size(x)); [covInd1 covInd2] = meshgrid(x,x); cov_f = sig_f * exp(-(covInd1-covInd2).^2 ./ l.^2); cholCov_f = chol(cov_f + 1e-10 * eye(size(cov_f)),'lower'); uids = randn(size(m_f)); f = m_f + cholCov_f * uids; 41 Kernel Methods: Gaussian Processes Drawing samples from a GP We will now draw samples from a Gaussian Process with a Squared Exponential covariance function. Here is the Matlab code to do that: l = 0.2; sig_f = 1; x = (-10:0.05:10)'; Set the points where the function will be evaluated m_f = zeros(size(x)); [covInd1 covInd2] = meshgrid(x,x); cov_f = sig_f * exp(-(covInd1-covInd2).^2 ./ l.^2); cholCov_f = chol(cov_f + 1e-10 * eye(size(cov_f)),'lower'); uids = randn(size(m_f)); f = m_f + cholCov_f * uids; 42 Kernel Methods: Gaussian Processes Drawing samples from a GP We will now draw samples from a Gaussian Process with a Squared Exponential covariance function. Here is the Matlab code to do that: l = 0.2; sig_f = 1; x = (-10:0.05:10)'; m_f = zeros(size(x)); Mean of the GP (set to zero) [covInd1 covInd2] = meshgrid(x,x); cov_f = sig_f * exp(-(covInd1-covInd2).^2 ./ l.^2); cholCov_f = chol(cov_f + 1e-10 * eye(size(cov_f)),'lower'); uids = randn(size(m_f)); f = m_f + cholCov_f * uids; 43 Kernel Methods: Gaussian Processes Drawing samples from a GP We will now draw samples from a Gaussian Process with a Squared Exponential covariance function. Here is the Matlab code to do that: l = 0.2; sig_f = 1; x = (-10:0.05:10)'; m_f = zeros(size(x)); [covInd1 covInd2] = meshgrid(x,x); Generate all the possible pairs of points cov_f = sig_f * exp(-(covInd1-covInd2).^2 ./ l.^2); cholCov_f = chol(cov_f + 1e-10 * eye(size(cov_f)),'lower'); uids = randn(size(m_f)); f = m_f + cholCov_f * uids; 44 Kernel Methods: Gaussian Processes Drawing samples from a GP We will now draw samples from a Gaussian Process with a Squared Exponential covariance function. Here is the Matlab code to do that: l = 0.2; sig_f = 1; x = (-10:0.05:10)'; m_f = zeros(size(x)); [covInd1 covInd2] = meshgrid(x,x); Calculate the covariance function cov_f = sig_f * exp(-(covInd1-covInd2).^2 ./ l.^2); for all the possible pairs of points cholCov_f = chol(cov_f + 1e-10 * eye(size(cov_f)),'lower'); uids = randn(size(m_f)); f = m_f + cholCov_f * uids; 45 Kernel Methods: Gaussian Processes Drawing samples from a GP We will now draw samples from a Gaussian Process with a Squared Exponential covariance function. Here is the Matlab code to do that: l = 0.2; sig_f = 1; x = (-10:0.05:10)'; m_f = zeros(size(x)); [covInd1 covInd2] = meshgrid(x,x); cov_f = sig_f * exp(-(covInd1-covInd2).^2 ./ l.^2); Calculate the Cholesky decomposition of the covariance function (add 10-9 to the diagonal to ensure positive definiteness). cholCov_f = chol(cov_f + 1e-10 * eye(size(cov_f)),'lower'); uids = randn(size(m_f)); f = m_f + cholCov_f * uids; 46 Kernel Methods: Gaussian Processes Drawing samples from a GP We will now draw samples from a Gaussian Process with a Squared Exponential covariance function. Here is the Matlab code to do that: l = 0.2; sig_f = 1; x = (-10:0.05:10)'; m_f = zeros(size(x)); [covInd1 covInd2] = meshgrid(x,x); cov_f = sig_f * exp(-(covInd1-covInd2).^2 ./ l.^2); cholCov_f = chol(cov_f + 1e-10 * eye(size(cov_f)),'lower'); Generate independent pseudorandom numbers f = m_f + cholCov_f * uids; drawn from the standard normal distribution. uids = randn(size(m_f)); 47 Kernel Methods: Gaussian Processes Drawing samples from a GP We will now draw samples from a Gaussian Process with a Squared Exponential covariance function. Here is the Matlab code to do that: l = 0.2; sig_f = 1; x = (-10:0.05:10)'; m_f = zeros(size(x)); [covInd1 covInd2] = meshgrid(x,x); cov_f = sig_f * exp(-(covInd1-covInd2).^2 ./ l.^2); cholCov_f = chol(cov_f + 1e-10 * eye(size(cov_f)),'lower'); uids = randn(size(m_f)); f = m_f + cholCov_f * uids; 48 Compute f which has the desired distribution with mean and covariance Kernel Methods: Gaussian Processes GP with SE covariance function The squared exponential covariance function leads to functions that are smooth over a characteristic length scale. In this case the covariance function parameters are . 4 3 2 f(x) 1 0 -1 -2 -3 -10 -8 -6 -4 -2 0 x 49 Kernel Methods: Gaussian Processes 2 4 6 8 10 GP with SE covariance function The squared exponential covariance function leads to functions that are smooth over a characteristic length scale. In this case the covariance function parameters are . 3 2 f(x) 1 0 -1 -2 -3 -10 -8 -6 -4 -2 0 x 50 Kernel Methods: Gaussian Processes 2 4 6 8 10 GP with SE covariance function The squared exponential covariance function leads to functions that are smooth over a characteristic length scale. In this case the covariance function parameters are . 3 2.5 2 1.5 f(x) 1 0.5 0 -0.5 -1 -1.5 -10 -8 -6 -4 -2 0 x 51 Kernel Methods: Gaussian Processes 2 4 6 8 10 GP with SE covariance function The squared exponential covariance function leads to functions that are smooth over a characteristic length scale. In this case the covariance function parameters are . 5 4 3 2 f(x) 1 0 -1 -2 -3 -4 -5 -10 -8 -6 -4 -2 0 x 52 Kernel Methods: Gaussian Processes 2 4 6 8 10 GP with exponential covariance function The exponential kernel corresponds to the Ornstein-Uhlenbeck process introduced in 1930 to describe Brownian motion. In this case the covariance function parameters are . 3 2 f(x) 1 0 -1 -2 -3 -10 -8 -6 -4 -2 0 x 53 Kernel Methods: Gaussian Processes 2 4 6 8 10 GP with exponential covariance function The exponential kernel corresponds to the Ornstein-Uhlenbeck process introduced in 1930 to describe Brownian motion. In this case the covariance function parameters are . 3 2 1 f(x) 0 -1 -2 -3 -4 -10 -8 -6 -4 -2 0 x 54 Kernel Methods: Gaussian Processes 2 4 6 8 10 GP with exponential covariance function The exponential kernel corresponds to the Ornstein-Uhlenbeck process introduced in 1930 to describe Brownian motion. In this case the covariance function parameters are . 2 1 f(x) 0 -1 -2 -3 -4 -10 -8 -6 -4 -2 0 x 55 Kernel Methods: Gaussian Processes 2 4 6 8 10 Prediction with noise free observations Prior: the joint distribution of the training outputs and the test outputs is Gaussian. The covariance between two points given by The joint covariance is given by the respective parts of the covariance matrices: 56 Kernel Methods: Gaussian Processes Prediction with noise free observations Prior: the joint distribution of the training outputs and the test outputs is Gaussian. The covariance between two points given by The joint covariance is given by the respective parts of the covariance matrices: NxN matrix 57 N*xN matrix Kernel Methods: Gaussian Processes N*xN* matrix NxN* matrix Posterior over functions Prior: joint distribution of training and test outputs. Posterior: restrict the joint prior distribution to contain only those functions which agree with the observed data points. reject functions that disagree with the observations (computationally inefficient). condition the joint Gaussian prior on the observations. 58 Kernel Methods: Gaussian Processes Posterior over functions Prior: joint distribution of training and test outputs. Posterior: restrict the joint prior distribution to contain only those functions which agree with the observed data points. reject functions that disagree with the observations (computationally inefficient). condition the joint Gaussian prior on the observations. If two sets of variables are jointly Gaussian, then the conditional distribution of one set conditioned on the other is again Gaussian. 59 Kernel Methods: Gaussian Processes Conditional Gaussian distributions Suppose is a -dimanesional vector with Gaussian distribution . Partition into two disjoint subsets (first components) and (remaining ). Define also the corresponding partitions of the mean vector and covariance matrix. Define partitions also for the precision matrix (the inverse of the covariance matrix): 60 Kernel Methods: Gaussian Processes Conditional Gaussian distributions We want to calculate the conditional distribution . Consider the quadratic form in the exponent of the gaussian distribution and use the partitioning previously defined: We see that as a function of this is again a quadratic form. Hence will be Gaussian. Complete the squares with respect to to find mean and covariance. 61 Kernel Methods: Gaussian Processes Completing the squares We are given a quadratic form defining the exponent terms in a Gaussian distribution, and we need to determine the corresponding mean and covariance. Exponent of general Gaussian distribution: Equate the coefficients: Quadratic term in : Linear term in : 62 Kernel Methods: Gaussian Processes Conditional Gaussian distributions Qaudratic term: Linear term (using ): Now the conditional mean and covariance are expressed in terms of the precision matrix. We need to express them in terms of the covariance matrix. 63 Kernel Methods: Gaussian Processes In terms of the covariance matrix... The prescision matrix is the inverse of the covariance matrix, therefore we have : Solving the linear system we obtain: 64 Kernel Methods: Gaussian Processes Conditional Gaussian distributions Substitution the two expression just obtained in the expressions for the conditional mean and covariance we obtain: 65 Kernel Methods: Gaussian Processes Prediction with noise free observations Condition the joint Gaussian prior on the observations in order to get the posterior distribution over functions: 66 Kernel Methods: Gaussian Processes Example: GP with SE covariance l = 1; sig_f = 1; x_test = (-10:0.05:10)'; m_f = zeros(size(x_test)); x_train = [-3 -2 2 7]; fx_train = [1 0 2 -2]; [covXXInd1 covXXInd2] = meshgrid(x_train,x_train); [covXXsInd1 covXXsInd2] = meshgrid(x_train,x_test); [covXsXsInd1 covXsXsInd2] = meshgrid(x_test,x_test); covXsXs = sig_f * exp(-(covXsXsInd1-covXsXsInd2).^2 ./ l.^2); covXX = sig_f * exp(-(covXXInd1-covXXInd2).^2 ./ l.^2); covXXs = sig_f * exp(-(covXXsInd1-covXXsInd2).^2 ./ l.^2); chol_covXX = chol(covXX + 1e-9); %posterior_mean = covXXs/covXX * fx_train'; posterior_mean = (covXXs/chol_covXX)/chol_covXX' * fx_train'; %posterior_cov = covXsXs - covXXs/covXX * covXXs'; posterior_cov = covXsXs - (covXXs/chol_covXX)/chol_covXX' * covXXs'; 67 Kernel Methods: Gaussian Processes Example: GP with SE covariance l = 1; sig_f = 1; x_test = (-10:0.05:10)'; m_f = zeros(size(x_test)); x_train = [-3 -2 2 7]; 4 observations (training points) fx_train = [1 0 2 -2]; [covXXInd1 covXXInd2] = meshgrid(x_train,x_train); [covXXsInd1 covXXsInd2] = meshgrid(x_train,x_test); [covXsXsInd1 covXsXsInd2] = meshgrid(x_test,x_test); covXsXs = sig_f * exp(-(covXsXsInd1-covXsXsInd2).^2 ./ l.^2); covXX = sig_f * exp(-(covXXInd1-covXXInd2).^2 ./ l.^2); covXXs = sig_f * exp(-(covXXsInd1-covXXsInd2).^2 ./ l.^2); chol_covXX = chol(covXX + 1e-9); %posterior_mean = covXXs/covXX * fx_train'; posterior_mean = (covXXs/chol_covXX)/chol_covXX' * fx_train'; %posterior_cov = covXsXs - covXXs/covXX * covXXs'; posterior_cov = covXsXs - (covXXs/chol_covXX)/chol_covXX' * covXXs'; 68 Kernel Methods: Gaussian Processes Example: GP with SE covariance l = 1; sig_f = 1; x_test = (-10:0.05:10)'; m_f = zeros(size(x_test)); x_train = [-3 -2 2 7]; fx_train = [1 0 2 -2]; [covXXInd1 covXXInd2] = meshgrid(x_train,x_train); [covXXsInd1 covXXsInd2] = meshgrid(x_train,x_test); [covXsXsInd1 covXsXsInd2] = meshgrid(x_test,x_test); covXsXs = sig_f * exp(-(covXsXsInd1-covXsXsInd2).^2 ./ l.^2); covXX = sig_f * exp(-(covXXInd1-covXXInd2).^2 ./ l.^2); covXXs = sig_f * exp(-(covXXsInd1-covXXsInd2).^2 ./ l.^2); chol_covXX = chol(covXX + 1e-9); Calculate the partitions of the joint covariance matrix %posterior_mean = covXXs/covXX * fx_train'; posterior_mean = (covXXs/chol_covXX)/chol_covXX' * fx_train'; %posterior_cov = covXsXs - covXXs/covXX * covXXs'; posterior_cov = covXsXs - (covXXs/chol_covXX)/chol_covXX' * covXXs'; 69 Kernel Methods: Gaussian Processes Example: GP with SE covariance l = 1; sig_f = 1; x_test = (-10:0.05:10)'; m_f = zeros(size(x_test)); x_train = [-3 -2 2 7]; fx_train = [1 0 2 -2]; [covXXInd1 covXXInd2] = meshgrid(x_train,x_train); [covXXsInd1 covXXsInd2] = meshgrid(x_train,x_test); [covXsXsInd1 covXsXsInd2] = meshgrid(x_test,x_test); covXsXs = sig_f * exp(-(covXsXsInd1-covXsXsInd2).^2 ./ l.^2); covXX = sig_f * exp(-(covXXInd1-covXXInd2).^2 ./ l.^2); covXXs = sig_f * exp(-(covXXsInd1-covXXsInd2).^2 ./ l.^2); chol_covXX = chol(covXX + 1e-9); Cholesky decomposition of K(X,X) – training of GP Complexity O(N3) %posterior_mean = covXXs/covXX * fx_train'; posterior_mean = (covXXs/chol_covXX)/chol_covXX' * fx_train'; %posterior_cov = covXsXs - covXXs/covXX * covXXs'; posterior_cov = covXsXs - (covXXs/chol_covXX)/chol_covXX' * covXXs'; 70 Kernel Methods: Gaussian Processes Example: GP with SE covariance l = 1; sig_f = 1; x_test = (-10:0.05:10)'; m_f = zeros(size(x_test)); x_train = [-3 -2 2 7]; fx_train = [1 0 2 -2]; [covXXInd1 covXXInd2] = meshgrid(x_train,x_train); [covXXsInd1 covXXsInd2] = meshgrid(x_train,x_test); [covXsXsInd1 covXsXsInd2] = meshgrid(x_test,x_test); covXsXs = sig_f * exp(-(covXsXsInd1-covXsXsInd2).^2 ./ l.^2); covXX = sig_f * exp(-(covXXInd1-covXXInd2).^2 ./ l.^2); covXXs = sig_f * exp(-(covXXsInd1-covXXsInd2).^2 ./ l.^2); chol_covXX = chol(covXX + 1e-9); Calculate predictive distribution Complexity O(N2) %posterior_mean = covXXs/covXX * fx_train'; posterior_mean = (covXXs/chol_covXX)/chol_covXX' * fx_train'; %posterior_cov = covXsXs - covXXs/covXX * covXXs'; posterior_cov = covXsXs - (covXXs/chol_covXX)/chol_covXX' * covXXs'; 71 Kernel Methods: Gaussian Processes Example: GP with SE covariance Predictive distribution of a GP with SE covariance function with parameters . The observations are drawn with black dots. 3 2 f(x) 1 0 -1 -2 -3 -10 -8 -6 -4 -2 0 x 72 Kernel Methods: Gaussian Processes 2 4 6 8 10 Example: GP with SE covariance Draw three samples from the predictive distribution. 4 3 2 f(x) 1 0 -1 -2 -3 -10 -8 -6 -4 -2 0 x 73 Kernel Methods: Gaussian Processes 2 4 6 8 10 Example: GP with SE covariance Predictive distribution of a GP with SE covariance function with parameters . The observations are drawn with black dots. 6 4 f(x) 2 0 -2 -4 -6 -10 -8 -6 -4 -2 0 x 74 Kernel Methods: Gaussian Processes 2 4 6 8 10 Example: GP with SE covariance Draw three samples from the predictive distribution. 6 4 f(x) 2 0 -2 -4 -6 -10 -8 -6 -4 -2 0 x 75 Kernel Methods: Gaussian Processes 2 4 6 8 10 Prediction with noisy observations If we assume additive i.i.d. Gaussian noise on the observation: The joint distribution of the observed target values and the function values at the test locations is: 76 Kernel Methods: Gaussian Processes Prediction with noisy observations Deriving the conditional distribution (analogous procedure to the case of noise free observations) we obtain the key predictive equations for Gaussian process regression: Same equations obtained using the weight-space view! 77 Kernel Methods: Gaussian Processes Example: GP with SE covariance l = 1.0; sig_f = 1; sig_n = 0.1; x_test = (-10:0.05:10)'; m_f = zeros(size(x_test)); x_train = [-3 -2 2 7]; y_train = [1 0 2 -2]; [covXXInd1 covXXInd2] = meshgrid(x_train,x_train); [covXXsInd1 covXXsInd2] = meshgrid(x_train,x_test); [covXsXsInd1 covXsXsInd2] = meshgrid(x_test,x_test); covXsXs = sig_f * exp(-(covXsXsInd1-covXsXsInd2).^2 ./ l.^2); covXXs = sig_f * exp(-(covXXsInd1-covXXsInd2).^2 ./ l.^2); covXX = sig_f * exp(-(covXXInd1-covXXInd2).^2 ./ l.^2); covXX_noisy = covXX + sig_n^2 *eye(size(covXX)); chol_covXX_noisy = chol(covXX_noisy); %posterior_mean = covXXs/covXX_noisy * fx_train'; posterior_mean = (covXXs/chol_covXX_noisy)/chol_covXX_noisy' * y_train'; %posterior_cov = covXsXs - covXXs/covXX_noisy * covXXs'; posterior_cov = covXsXs - (covXXs/chol_covXX_noisy)/chol_covXX_noisy' * covXXs'; 78 Kernel Methods: Gaussian Processes Example: GP with SE covariance l = 1.0; sig_f = 1; sig_n = 0.1; Standard deviation of the noise on the observation x_test = (-10:0.05:10)'; m_f = zeros(size(x_test)); x_train = [-3 -2 2 7]; y_train = [1 0 2 -2]; [covXXInd1 covXXInd2] = meshgrid(x_train,x_train); [covXXsInd1 covXXsInd2] = meshgrid(x_train,x_test); [covXsXsInd1 covXsXsInd2] = meshgrid(x_test,x_test); covXsXs = sig_f * exp(-(covXsXsInd1-covXsXsInd2).^2 ./ l.^2); covXXs = sig_f * exp(-(covXXsInd1-covXXsInd2).^2 ./ l.^2); covXX = sig_f * exp(-(covXXInd1-covXXInd2).^2 ./ l.^2); covXX_noisy = covXX + sig_n^2 *eye(size(covXX)); chol_covXX_noisy = chol(covXX_noisy); %posterior_mean = covXXs/covXX_noisy * fx_train'; posterior_mean = (covXXs/chol_covXX_noisy)/chol_covXX_noisy' * y_train'; %posterior_cov = covXsXs - covXXs/covXX_noisy * covXXs'; posterior_cov = covXsXs - (covXXs/chol_covXX_noisy)/chol_covXX_noisy' * covXXs'; 79 Kernel Methods: Gaussian Processes Example: GP with SE covariance l = 1.0; sig_f = 1; sig_n = 0.1; x_test = (-10:0.05:10)'; m_f = zeros(size(x_test)); x_train = [-3 -2 2 7]; y_train = [1 0 2 -2]; [covXXInd1 covXXInd2] = meshgrid(x_train,x_train); [covXXsInd1 covXXsInd2] = meshgrid(x_train,x_test); [covXsXsInd1 covXsXsInd2] = meshgrid(x_test,x_test); covXsXs = sig_f * exp(-(covXsXsInd1-covXsXsInd2).^2 ./ l.^2); covXXs = sig_f * exp(-(covXXsInd1-covXXsInd2).^2 ./ l.^2); covXX = sig_f * exp(-(covXXInd1-covXXInd2).^2 ./ l.^2); covXX_noisy = covXX + sig_n^2 *eye(size(covXX)); Add the noise to the diagonal chol_covXX_noisy = chol(covXX_noisy); %posterior_mean = covXXs/covXX_noisy * fx_train'; posterior_mean = (covXXs/chol_covXX_noisy)/chol_covXX_noisy' * y_train'; %posterior_cov = covXsXs - covXXs/covXX_noisy * covXXs'; posterior_cov = covXsXs - (covXXs/chol_covXX_noisy)/chol_covXX_noisy' * covXXs'; 80 Kernel Methods: Gaussian Processes of K(X,X) Example: GP with SE covariance Predictive distribution of a GP with SE covariance function with parameters . The standard deviation of the noise on the observations is . 3 2 f(x) 1 0 -1 -2 -3 -10 -8 -6 -4 -2 0 x 81 Kernel Methods: Gaussian Processes 2 4 6 8 10 Example: GP with SE covariance Draw three samples from the predictive distribution. 4 3 2 f(x) 1 0 -1 -2 -3 -10 -8 -6 -4 -2 0 x 82 Kernel Methods: Gaussian Processes 2 4 6 8 10 Example: GP with SE covariance Predictive distribution of a GP with SE covariance function with parameters . The standard deviation of the noise on the observations is . 3 2 f(x) 1 0 -1 -2 -3 -10 -8 -6 -4 -2 0 x 83 Kernel Methods: Gaussian Processes 2 4 6 8 10 Example: GP with SE covariance Draw three samples from the predictive distribution. 4 3 2 f(x) 1 0 -1 -2 -3 -10 -8 -6 -4 -2 0 x 84 Kernel Methods: Gaussian Processes 2 4 6 8 10 Example: GP with SE covariance Predictive distribution of a GP with SE covariance function with parameters . The standard deviation of the noise on the observations is . 5 4 3 2 f(x) 1 0 -1 -2 -3 -4 -10 -8 -6 -4 -2 0 x 85 Kernel Methods: Gaussian Processes 2 4 6 8 10 Example: GP with SE covariance Draw three samples from the predictive distribution. 5 4 3 2 f(x) 1 0 -1 -2 -3 -4 -10 -8 -6 -4 -2 0 x 86 Kernel Methods: Gaussian Processes 2 4 6 8 10 Example: GP with SE covariance Predictive distribution of a GP with SE covariance function with parameters . The standard deviation of the noise on the observations is . 3 2 f(x) 1 0 -1 -2 -3 -10 -8 -6 -4 -2 0 x 87 Kernel Methods: Gaussian Processes 2 4 6 8 10 Example: GP with SE covariance Draw three samples from the predictive distribution. 3 2 f(x) 1 0 -1 -2 -3 -10 -8 -6 -4 -2 0 x 88 Kernel Methods: Gaussian Processes 2 4 6 8 10 Model Selection For Gaussian Processes, the model selection problem boils down to the selection of: The covariance function. The value of the parameters of the covariance function. The value of the standard deviation of the noise in the observations. Once the covariance function has been selected, the parameters can be selected maximizing the marginal likelihood (or evidence) . The parameters of the covariance function and the standard deviation of the noise are called the hyperparameters of the model. 89 Kernel Methods: Gaussian Processes Marginal likelihood The marginal likelihood is the integral of the likelihood times the prior: Under the gaussian process model the prior is Gaussian and the likelihood is a factorized Gaussian. Therefore the logarithm of the marginal likelihood has the following equation: 90 Kernel Methods: Gaussian Processes Model Selection The hyperparameters can be set by maximizing the marginal likelihood. If gradient based methods are used, the partial derivatives of the marginal likelihood w.r.t. to the hyperparameters have to be calculated. There can be multiple local maxima. Cross-validation methods can be used. 91 Kernel Methods: Gaussian Processes Classification with Gaussian Processes In the regression case we have (weight-space view): Gaussian prior over the weights Gaussian posterior over the weights Gaussian predictive distribution Gaussian likelihood of the obsertvations We can calculate all these distributions analytically! 92 Kernel Methods: Gaussian Processes Classification with Gaussian Processes Gaussian prior over the weights Gaussian posterior over the weights Gaussian likelihood of the obsertvations 93 Kernel Methods: Gaussian Processes Gaussian predictive distribution Classification with Gaussian Processes Gaussian prior over the weights Gaussian posterior over the weights Gaussian predictive distribution Gaussian likelihood of the obsertvations In classification the targets are discrete class labels, the Gaussian likelihood is inappropriate. A sigmoid shaped likelihood is much better! The predictive distribution does not have anymore a simple analytical form and has to be calculated using analitycal approximation of integrals or Monte Carlo sampling. 94 Kernel Methods: Gaussian Processes Book Readings Pattern Recognition and Machine Learning (Bishop) Ch. 2.3.1, 6.4.1, 6.4.2 Gaussian Processes for Machine Learning (Rasmussen, Williams) Ch. 2 95 Kernel Methods: Gaussian Processes