Download Kernel Methods Gaussian Processes

Document related concepts
no text concepts found
Transcript
Kernel Methods
Gaussian Processes
Marco Trincavelli
5/12/2011
Mobile Robotics and Olfaction Lab
AASS Research Centre, Örebro University
State of the Art Methods of Data Modeling and Machine Learning,
IMRIS program, Fall 2011
Acknowledgments
These slides have been adapted from the slides created by Achim
Lilienthal for an introductory seminar on Gaussian Processes.
Achim Lilienthal
2
Kernel Methods: Gaussian Processes
Repetition
1. Repetition
2. Weight-space view: Bayesian linear regression
3. Weight-space view: Bayesian nonlinear regression
4. Function-space view
3
Kernel Methods: Gaussian Processes
SVM: Training and Predicting
 Training a SVM – maximizing the dual:
 Making predictions with a SVM:
Where 4
is the set of support vectors.
Kernel Methods: Gaussian Processes
SVM: Training and Predicting
 Training a SVM – maximizing the dual:
 Making predictions with a SVM:
Where 5
is the set of support vectors.
Kernel Methods: Gaussian Processes
Only dot products of the
input both in the training and
in the prediction phase!
Dot product Kernel Trick
If we can find kernel function such that:
Then we don’t even have to know the mapping to solve the problem...
This has two advantages:
1. Save a lot of computation by not having to compute the mapping
and then train in the high dimensional space...
2. The data can be projected in a deliberately high dimensional
space, even infinite... (we have to be careful with this!)
6
Kernel Methods: Gaussian Processes
Example: from 2D to 3D
Nonlinear Problem
7
Kernel Methods: Gaussian Processes
Linear Problem
Video: Mapping and Kernel
Thanks to Udi Aharoni,YouTube link for the video: http://www.youtube.com/watch?v=3liCbRZPrZA
8
Kernel Methods: Gaussian Processes
SVM with Gaussian Kernel
 The most common SVM
 Lattice search using cross-validation
9
Kernel Methods: Gaussian Processes
What is a Gaussian Process?
 A Gaussian Process (GP) is a stochastic process, i.e. a
generalization of a probability distribution to functions:


probability distribution describes a finite-dimensional random variable.
stochastic process describes a distribution over functions.
 Inference can be illustrated to take place in the space of the model´s
parameters (weight-space view) or in the space of the functions
(function-space view).
 GPs are a particularly effective method for placing a prior distribution
over the space of functions or over the space of the model´s parameters.
 A Bayesian version of SVM, an infinitely large neural network.
10
Kernel Methods: Gaussian Processes
What is a Gaussian Process?
 Formal definition:
A Gaussian process is a collection of random
variables, any finite number of which have a joint
Gaussian distribution.
 The random variables represent the value of the function at
location x.
 A GP is completely specified by its mean function and
by its covariance function .
 Gaussian processes particularly refer to an infinite index set.
11
Kernel Methods: Gaussian Processes
Weight-space view:
Bayesian linear regression
1. Repetition
2. Weight-space view: Bayesian linear regression
3. Weight-space view: Bayesian nonlinear regression
4. Function-space view
12
Kernel Methods: Gaussian Processes
Linear Regression
 We assume linear process:
 We use a linear model family .
 ...and the goal is to make 13
Kernel Methods: Gaussian Processes
.
Bayesian Linear Regression
 Compute the posterior distribution of the weight vector given
a likelihood function for the observations (distribution of the
noise ) and a prior distribution over :
 With normally distributed error and uniform prior over the
weight Bayesian regression is equivalent to least-squares (SSE).
 With normally distributed error and normal prior over the
weight (with zero mean) Bayesian regression is equivalent to
ridge regression.
14
Kernel Methods: Gaussian Processes
Computing the posterior over weights
 Likelihood of the observation (normally distributed error):
 Normal prior over the weights (normal distribution):
15
Kernel Methods: Gaussian Processes
Computing the posterior over weights
16
Kernel Methods: Gaussian Processes
Computing the posterior over weights
Quadratic
term
17
Kernel Methods: Gaussian Processes
Linear
term
Constant
term
Completing the squares
We are given a quadratic form defining the exponent terms in a
Gaussian distribution, and we need to determine the corresponding
mean and covariance. Exponent of general Gaussian distribution:
Equate the coefficients:
 Quadratic term in :  Linear term in : 18
Kernel Methods: Gaussian Processes
Computing the posterior over weights
 Quadratic term:
 Linear term:
Therefore the posterior of the weights is Gaussian:
19
Kernel Methods: Gaussian Processes
Example: posterior over the weights
 Linear model, one input:
 Gaussian prior over the weights:
w1
w0
20
Kernel Methods: Gaussian Processes
Example: posterior over the weights
 We get observations.
 We calculate the likelihood of the observations assuming
21
Kernel Methods: Gaussian Processes
:
Example: posterior over the weights
 We get observations.
 We calculate the likelihood of the observations assuming
:
w1
w0
22
Kernel Methods: Gaussian Processes
Example: posterior over the weights
Compute the posterior of the weights using bayes rule:
w1
Prior
Likelihood
w1
w0
23
Posterior
Kernel Methods: Gaussian Processes
w1
w0
w0
Make predictions for a query point
 Averaging the output of all possible linear models w.r.t. the
Gaussian posterior over the model parameters (the weights) we
obtain the predictive distribution:
 The predictive distribution is again Gaussian, with the following
mean and covariance:
24
Kernel Methods: Gaussian Processes
Example: predictions for query points
The posterior distribution over weights induces a a predictive
distribution over functions:
Posterior over weights
w1
w0
25
Kernel Methods: Gaussian Processes
Predictive distribution
over functions
Weight-space view:
Bayesian nonlinear regression
1. Repetition
2. Weight-space view: Bayesian linear regression
3. Weight-space view: Bayesian nonlinear regression
4. Function-space view
26
Kernel Methods: Gaussian Processes
Towards nonlinear models
 The Bayesian linear model suffers from limited expressiveness.
 Project the inputs into some high dimensional space using a set of
basis functions and apply the linear model in the high dimensional
space.
 In the input space the model will be in general nonlinear.
 As we have already observed discussing about SVMs, if our model
uses only dot products of the inputs we can avoid calculating the
projection and work with kernels instead.
27
Kernel Methods: Gaussian Processes
Making predictions in feature space
 Predictive distribution in input space:
 Introduce a function which maps a -dimensional input
vector into an dimensional feature space ( ).
 Predictive distribution in feature space:
28
Kernel Methods: Gaussian Processes
Predictive mean in feature space
Consider the expression 29
Kernel Methods: Gaussian Processes
:
Predictive mean in feature space
Multiply by left:
Substitute in the expression of the mean :
30
Kernel Methods: Gaussian Processes
from right and from
Predictive covariance in feature space
Apply the matrix inversion lemma on A:
Set 31
Kernel Methods: Gaussian Processes
:
Dot products
 Notice that is a dot product. Indeed, since is
positive definite, it is possible to find a Cholesky decomposition
. Let us define a kernel (or covariance function):
Where we used .
 In the equation of the predictive mean and covariance the feature
space appears always in the form .
We can remove it from the formulas using the kernel!
32
Kernel Methods: Gaussian Processes
Predictive distribution
Using the kernel expression we can re-write the
formulas for the predictive mean and variance in the following
way:
33
Kernel Methods: Gaussian Processes
Function-space view
1. Repetition
2. Weight-space view: Bayesian linear regression
3. Weight-space view: Bayesian nonlinear regression
4. Function-space view
34
Kernel Methods: Gaussian Processes
The function-space view
A Gaussian process is a collection of random
variables, any finite number of which have a joint
Gaussian distribution.
 The random variables represent the value of the function at
location x.
 A GP is completely specified by its mean function by its covariance function 35
Kernel Methods: Gaussian Processes
.
and
Bayesian linear regression revisited
 The Bayesian linear regression model with
prior is a simple example of Gaussian Process.
 The mean and covariance of the process are:
36
Kernel Methods: Gaussian Processes
Bayesian linear regression revisited
 The Bayesian linear regression model with
prior is a simple example of Gaussian Process.
 The mean and covariance of the process are:
Recall
This is how we defined a
kernel or covariance matrix
37
Kernel Methods: Gaussian Processes
Mean and Covariance
 For notational simplicity we will take the value of the mean
but this is in general not needed.
 The covariance function specifies the covariance between pairs of
random variables:
 Note that the covariance between outputs is written as a function
of the inputs.
 The specification of the covariance function implies a distribution
over functions.
38
Kernel Methods: Gaussian Processes
Covariance Functions
 Linear
 Squared Exponential
 Exponential
39
Kernel Methods: Gaussian Processes
Drawing samples from a GP
We will now draw samples from a Gaussian Process with a Squared
Exponential covariance function. Here is the Matlab code to do
that:
l = 0.2;
sig_f = 1;
x = (-10:0.05:10)';
m_f = zeros(size(x));
[covInd1 covInd2] = meshgrid(x,x);
cov_f = sig_f * exp(-(covInd1-covInd2).^2 ./ l.^2);
cholCov_f = chol(cov_f + 1e-10 * eye(size(cov_f)),'lower');
uids = randn(size(m_f));
f = m_f + cholCov_f * uids;
40
Kernel Methods: Gaussian Processes
Drawing samples from a GP
We will now draw samples from a Gaussian Process with a Squared
Exponential covariance function. Here is the Matlab code to do
that:
l = 0.2;
sig_f = 1;
Set the parameters of the covariance function
x = (-10:0.05:10)';
m_f = zeros(size(x));
[covInd1 covInd2] = meshgrid(x,x);
cov_f = sig_f * exp(-(covInd1-covInd2).^2 ./ l.^2);
cholCov_f = chol(cov_f + 1e-10 * eye(size(cov_f)),'lower');
uids = randn(size(m_f));
f = m_f + cholCov_f * uids;
41
Kernel Methods: Gaussian Processes
Drawing samples from a GP
We will now draw samples from a Gaussian Process with a Squared
Exponential covariance function. Here is the Matlab code to do
that:
l = 0.2;
sig_f = 1;
x = (-10:0.05:10)';
Set the points where the function will be evaluated
m_f = zeros(size(x));
[covInd1 covInd2] = meshgrid(x,x);
cov_f = sig_f * exp(-(covInd1-covInd2).^2 ./ l.^2);
cholCov_f = chol(cov_f + 1e-10 * eye(size(cov_f)),'lower');
uids = randn(size(m_f));
f = m_f + cholCov_f * uids;
42
Kernel Methods: Gaussian Processes
Drawing samples from a GP
We will now draw samples from a Gaussian Process with a Squared
Exponential covariance function. Here is the Matlab code to do
that:
l = 0.2;
sig_f = 1;
x = (-10:0.05:10)';
m_f = zeros(size(x));
Mean of the GP (set to zero)
[covInd1 covInd2] = meshgrid(x,x);
cov_f = sig_f * exp(-(covInd1-covInd2).^2 ./ l.^2);
cholCov_f = chol(cov_f + 1e-10 * eye(size(cov_f)),'lower');
uids = randn(size(m_f));
f = m_f + cholCov_f * uids;
43
Kernel Methods: Gaussian Processes
Drawing samples from a GP
We will now draw samples from a Gaussian Process with a Squared
Exponential covariance function. Here is the Matlab code to do
that:
l = 0.2;
sig_f = 1;
x = (-10:0.05:10)';
m_f = zeros(size(x));
[covInd1 covInd2] = meshgrid(x,x);
Generate all the possible pairs of points
cov_f = sig_f * exp(-(covInd1-covInd2).^2 ./ l.^2);
cholCov_f = chol(cov_f + 1e-10 * eye(size(cov_f)),'lower');
uids = randn(size(m_f));
f = m_f + cholCov_f * uids;
44
Kernel Methods: Gaussian Processes
Drawing samples from a GP
We will now draw samples from a Gaussian Process with a Squared
Exponential covariance function. Here is the Matlab code to do
that:
l = 0.2;
sig_f = 1;
x = (-10:0.05:10)';
m_f = zeros(size(x));
[covInd1 covInd2] = meshgrid(x,x);
Calculate the covariance function
cov_f = sig_f * exp(-(covInd1-covInd2).^2 ./ l.^2); for all the possible pairs of points
cholCov_f = chol(cov_f + 1e-10 * eye(size(cov_f)),'lower');
uids = randn(size(m_f));
f = m_f + cholCov_f * uids;
45
Kernel Methods: Gaussian Processes
Drawing samples from a GP
We will now draw samples from a Gaussian Process with a Squared
Exponential covariance function. Here is the Matlab code to do
that:
l = 0.2;
sig_f = 1;
x = (-10:0.05:10)';
m_f = zeros(size(x));
[covInd1 covInd2] = meshgrid(x,x);
cov_f = sig_f * exp(-(covInd1-covInd2).^2 ./ l.^2);
Calculate the Cholesky
decomposition of the covariance
function (add 10-9 to the diagonal
to ensure positive definiteness).
cholCov_f = chol(cov_f + 1e-10 * eye(size(cov_f)),'lower');
uids = randn(size(m_f));
f = m_f + cholCov_f * uids;
46
Kernel Methods: Gaussian Processes
Drawing samples from a GP
We will now draw samples from a Gaussian Process with a Squared
Exponential covariance function. Here is the Matlab code to do
that:
l = 0.2;
sig_f = 1;
x = (-10:0.05:10)';
m_f = zeros(size(x));
[covInd1 covInd2] = meshgrid(x,x);
cov_f = sig_f * exp(-(covInd1-covInd2).^2 ./ l.^2);
cholCov_f = chol(cov_f + 1e-10 * eye(size(cov_f)),'lower');
Generate independent pseudorandom numbers
f = m_f + cholCov_f * uids; drawn from the standard normal distribution.
uids = randn(size(m_f));
47
Kernel Methods: Gaussian Processes
Drawing samples from a GP
We will now draw samples from a Gaussian Process with a Squared
Exponential covariance function. Here is the Matlab code to do
that:
l = 0.2;
sig_f = 1;
x = (-10:0.05:10)';
m_f = zeros(size(x));
[covInd1 covInd2] = meshgrid(x,x);
cov_f = sig_f * exp(-(covInd1-covInd2).^2 ./ l.^2);
cholCov_f = chol(cov_f + 1e-10 * eye(size(cov_f)),'lower');
uids = randn(size(m_f));
f = m_f + cholCov_f * uids;
48
Compute f which has the desired distribution
with mean and covariance
Kernel Methods: Gaussian Processes
GP with SE covariance function
The squared exponential covariance function leads to functions that
are smooth over a characteristic length scale. In this case the
covariance function parameters are .
4
3
2
f(x)
1
0
-1
-2
-3
-10
-8
-6
-4
-2
0
x
49
Kernel Methods: Gaussian Processes
2
4
6
8
10
GP with SE covariance function
The squared exponential covariance function leads to functions that
are smooth over a characteristic length scale. In this case the
covariance function parameters are .
3
2
f(x)
1
0
-1
-2
-3
-10
-8
-6
-4
-2
0
x
50
Kernel Methods: Gaussian Processes
2
4
6
8
10
GP with SE covariance function
The squared exponential covariance function leads to functions that
are smooth over a characteristic length scale. In this case the
covariance function parameters are .
3
2.5
2
1.5
f(x)
1
0.5
0
-0.5
-1
-1.5
-10
-8
-6
-4
-2
0
x
51
Kernel Methods: Gaussian Processes
2
4
6
8
10
GP with SE covariance function
The squared exponential covariance function leads to functions that
are smooth over a characteristic length scale. In this case the
covariance function parameters are .
5
4
3
2
f(x)
1
0
-1
-2
-3
-4
-5
-10
-8
-6
-4
-2
0
x
52
Kernel Methods: Gaussian Processes
2
4
6
8
10
GP with exponential covariance function
The exponential kernel corresponds to the Ornstein-Uhlenbeck process
introduced in 1930 to describe Brownian motion. In this case the
covariance function parameters are .
3
2
f(x)
1
0
-1
-2
-3
-10
-8
-6
-4
-2
0
x
53
Kernel Methods: Gaussian Processes
2
4
6
8
10
GP with exponential covariance function
The exponential kernel corresponds to the Ornstein-Uhlenbeck process
introduced in 1930 to describe Brownian motion. In this case the
covariance function parameters are .
3
2
1
f(x)
0
-1
-2
-3
-4
-10
-8
-6
-4
-2
0
x
54
Kernel Methods: Gaussian Processes
2
4
6
8
10
GP with exponential covariance function
The exponential kernel corresponds to the Ornstein-Uhlenbeck process
introduced in 1930 to describe Brownian motion. In this case the
covariance function parameters are .
2
1
f(x)
0
-1
-2
-3
-4
-10
-8
-6
-4
-2
0
x
55
Kernel Methods: Gaussian Processes
2
4
6
8
10
Prediction with noise free observations
 Prior: the joint distribution of the training outputs and the test outputs is Gaussian.
 The covariance between two points given by  The joint covariance is given by the respective parts of the
covariance matrices:
56
Kernel Methods: Gaussian Processes
Prediction with noise free observations
 Prior: the joint distribution of the training outputs and the test outputs is Gaussian.
 The covariance between two points given by  The joint covariance is given by the respective parts of the
covariance matrices:
NxN matrix
57
N*xN matrix
Kernel Methods: Gaussian Processes
N*xN* matrix NxN* matrix
Posterior over functions
 Prior: joint distribution of training and test outputs.
 Posterior: restrict the joint prior distribution to contain only
those functions which agree with the observed data points.
 reject functions that disagree with the observations (computationally inefficient).
 condition the joint Gaussian prior on the observations.
58
Kernel Methods: Gaussian Processes
Posterior over functions
 Prior: joint distribution of training and test outputs.
 Posterior: restrict the joint prior distribution to contain only
those functions which agree with the observed data points.
 reject functions that disagree with the observations (computationally inefficient).
 condition the joint Gaussian prior on the observations.
If two sets of variables are jointly Gaussian, then the conditional
distribution of one set conditioned on the other is again Gaussian.
59
Kernel Methods: Gaussian Processes
Conditional Gaussian distributions
 Suppose is a -dimanesional vector with Gaussian distribution
.
 Partition into two disjoint subsets (first components)
and (remaining ).
 Define also the corresponding partitions of the mean vector and
covariance matrix. Define partitions also for the precision matrix
(the inverse of the covariance matrix):
60
Kernel Methods: Gaussian Processes
Conditional Gaussian distributions
 We want to calculate the conditional distribution .
 Consider the quadratic form in the exponent of the gaussian
distribution and use the partitioning previously defined:
 We see that as a function of this is again a quadratic form.
Hence will be Gaussian.
 Complete the squares with respect to to find mean and
covariance.
61
Kernel Methods: Gaussian Processes
Completing the squares
We are given a quadratic form defining the exponent terms in a
Gaussian distribution, and we need to determine the corresponding
mean and covariance. Exponent of general Gaussian distribution:
Equate the coefficients:
 Quadratic term in :  Linear term in : 62
Kernel Methods: Gaussian Processes
Conditional Gaussian distributions
 Qaudratic term:
 Linear term (using ):
 Now the conditional mean and covariance are expressed in terms
of the precision matrix. We need to express them in terms of the
covariance matrix.
63
Kernel Methods: Gaussian Processes
In terms of the covariance matrix...
 The prescision matrix is the inverse of the covariance matrix,
therefore we have :
 Solving the linear system we obtain:
64
Kernel Methods: Gaussian Processes
Conditional Gaussian distributions
 Substitution the two expression just obtained in the expressions
for the conditional mean and covariance we obtain:
65
Kernel Methods: Gaussian Processes
Prediction with noise free observations
 Condition the joint Gaussian prior on the observations in order
to get the posterior distribution over functions:
66
Kernel Methods: Gaussian Processes
Example: GP with SE covariance
l = 1;
sig_f = 1;
x_test = (-10:0.05:10)';
m_f = zeros(size(x_test));
x_train = [-3 -2 2 7];
fx_train = [1 0 2 -2];
[covXXInd1 covXXInd2] = meshgrid(x_train,x_train);
[covXXsInd1 covXXsInd2] = meshgrid(x_train,x_test);
[covXsXsInd1 covXsXsInd2] = meshgrid(x_test,x_test);
covXsXs = sig_f * exp(-(covXsXsInd1-covXsXsInd2).^2 ./ l.^2);
covXX = sig_f * exp(-(covXXInd1-covXXInd2).^2 ./ l.^2);
covXXs = sig_f * exp(-(covXXsInd1-covXXsInd2).^2 ./ l.^2);
chol_covXX = chol(covXX + 1e-9);
%posterior_mean = covXXs/covXX * fx_train';
posterior_mean = (covXXs/chol_covXX)/chol_covXX' * fx_train';
%posterior_cov = covXsXs - covXXs/covXX * covXXs';
posterior_cov = covXsXs - (covXXs/chol_covXX)/chol_covXX' * covXXs';
67
Kernel Methods: Gaussian Processes
Example: GP with SE covariance
l = 1;
sig_f = 1;
x_test = (-10:0.05:10)';
m_f = zeros(size(x_test));
x_train = [-3 -2 2 7];
4 observations (training points)
fx_train = [1 0 2 -2];
[covXXInd1 covXXInd2] = meshgrid(x_train,x_train);
[covXXsInd1 covXXsInd2] = meshgrid(x_train,x_test);
[covXsXsInd1 covXsXsInd2] = meshgrid(x_test,x_test);
covXsXs = sig_f * exp(-(covXsXsInd1-covXsXsInd2).^2 ./ l.^2);
covXX = sig_f * exp(-(covXXInd1-covXXInd2).^2 ./ l.^2);
covXXs = sig_f * exp(-(covXXsInd1-covXXsInd2).^2 ./ l.^2);
chol_covXX = chol(covXX + 1e-9);
%posterior_mean = covXXs/covXX * fx_train';
posterior_mean = (covXXs/chol_covXX)/chol_covXX' * fx_train';
%posterior_cov = covXsXs - covXXs/covXX * covXXs';
posterior_cov = covXsXs - (covXXs/chol_covXX)/chol_covXX' * covXXs';
68
Kernel Methods: Gaussian Processes
Example: GP with SE covariance
l = 1;
sig_f = 1;
x_test = (-10:0.05:10)';
m_f = zeros(size(x_test));
x_train = [-3 -2 2 7];
fx_train = [1 0 2 -2];
[covXXInd1 covXXInd2] = meshgrid(x_train,x_train);
[covXXsInd1 covXXsInd2] = meshgrid(x_train,x_test);
[covXsXsInd1 covXsXsInd2] = meshgrid(x_test,x_test);
covXsXs = sig_f * exp(-(covXsXsInd1-covXsXsInd2).^2 ./ l.^2);
covXX = sig_f * exp(-(covXXInd1-covXXInd2).^2 ./ l.^2);
covXXs = sig_f * exp(-(covXXsInd1-covXXsInd2).^2 ./ l.^2);
chol_covXX = chol(covXX + 1e-9);
Calculate the partitions of
the joint covariance matrix
%posterior_mean = covXXs/covXX * fx_train';
posterior_mean = (covXXs/chol_covXX)/chol_covXX' * fx_train';
%posterior_cov = covXsXs - covXXs/covXX * covXXs';
posterior_cov = covXsXs - (covXXs/chol_covXX)/chol_covXX' * covXXs';
69
Kernel Methods: Gaussian Processes
Example: GP with SE covariance
l = 1;
sig_f = 1;
x_test = (-10:0.05:10)';
m_f = zeros(size(x_test));
x_train = [-3 -2 2 7];
fx_train = [1 0 2 -2];
[covXXInd1 covXXInd2] = meshgrid(x_train,x_train);
[covXXsInd1 covXXsInd2] = meshgrid(x_train,x_test);
[covXsXsInd1 covXsXsInd2] = meshgrid(x_test,x_test);
covXsXs = sig_f * exp(-(covXsXsInd1-covXsXsInd2).^2 ./ l.^2);
covXX = sig_f * exp(-(covXXInd1-covXXInd2).^2 ./ l.^2);
covXXs = sig_f * exp(-(covXXsInd1-covXXsInd2).^2 ./ l.^2);
chol_covXX = chol(covXX + 1e-9);
Cholesky decomposition
of K(X,X) – training of GP
Complexity O(N3)
%posterior_mean = covXXs/covXX * fx_train';
posterior_mean = (covXXs/chol_covXX)/chol_covXX' * fx_train';
%posterior_cov = covXsXs - covXXs/covXX * covXXs';
posterior_cov = covXsXs - (covXXs/chol_covXX)/chol_covXX' * covXXs';
70
Kernel Methods: Gaussian Processes
Example: GP with SE covariance
l = 1;
sig_f = 1;
x_test = (-10:0.05:10)';
m_f = zeros(size(x_test));
x_train = [-3 -2 2 7];
fx_train = [1 0 2 -2];
[covXXInd1 covXXInd2] = meshgrid(x_train,x_train);
[covXXsInd1 covXXsInd2] = meshgrid(x_train,x_test);
[covXsXsInd1 covXsXsInd2] = meshgrid(x_test,x_test);
covXsXs = sig_f * exp(-(covXsXsInd1-covXsXsInd2).^2 ./ l.^2);
covXX = sig_f * exp(-(covXXInd1-covXXInd2).^2 ./ l.^2);
covXXs = sig_f * exp(-(covXXsInd1-covXXsInd2).^2 ./ l.^2);
chol_covXX = chol(covXX + 1e-9);
Calculate predictive distribution
Complexity O(N2)
%posterior_mean = covXXs/covXX * fx_train';
posterior_mean = (covXXs/chol_covXX)/chol_covXX' * fx_train';
%posterior_cov = covXsXs - covXXs/covXX * covXXs';
posterior_cov = covXsXs - (covXXs/chol_covXX)/chol_covXX' * covXXs';
71
Kernel Methods: Gaussian Processes
Example: GP with SE covariance
Predictive distribution of a GP with SE covariance function with
parameters . The observations are drawn with
black dots.
3
2
f(x)
1
0
-1
-2
-3
-10
-8
-6
-4
-2
0
x
72
Kernel Methods: Gaussian Processes
2
4
6
8
10
Example: GP with SE covariance
Draw three samples from the predictive distribution.
4
3
2
f(x)
1
0
-1
-2
-3
-10
-8
-6
-4
-2
0
x
73
Kernel Methods: Gaussian Processes
2
4
6
8
10
Example: GP with SE covariance
Predictive distribution of a GP with SE covariance function with
parameters . The observations are drawn with
black dots.
6
4
f(x)
2
0
-2
-4
-6
-10
-8
-6
-4
-2
0
x
74
Kernel Methods: Gaussian Processes
2
4
6
8
10
Example: GP with SE covariance
Draw three samples from the predictive distribution.
6
4
f(x)
2
0
-2
-4
-6
-10
-8
-6
-4
-2
0
x
75
Kernel Methods: Gaussian Processes
2
4
6
8
10
Prediction with noisy observations
 If we assume additive i.i.d. Gaussian noise on the observation:
 The joint distribution of the observed target values and the
function values at the test locations is:
76
Kernel Methods: Gaussian Processes
Prediction with noisy observations
 Deriving the conditional distribution (analogous
procedure to the case of noise free observations) we obtain the
key predictive equations for Gaussian process regression:
 Same equations obtained using the weight-space view!
77
Kernel Methods: Gaussian Processes
Example: GP with SE covariance
l = 1.0;
sig_f = 1;
sig_n = 0.1;
x_test = (-10:0.05:10)';
m_f = zeros(size(x_test));
x_train = [-3 -2 2 7];
y_train = [1 0 2 -2];
[covXXInd1 covXXInd2] = meshgrid(x_train,x_train);
[covXXsInd1 covXXsInd2] = meshgrid(x_train,x_test);
[covXsXsInd1 covXsXsInd2] = meshgrid(x_test,x_test);
covXsXs = sig_f * exp(-(covXsXsInd1-covXsXsInd2).^2 ./ l.^2);
covXXs = sig_f * exp(-(covXXsInd1-covXXsInd2).^2 ./ l.^2);
covXX = sig_f * exp(-(covXXInd1-covXXInd2).^2 ./ l.^2);
covXX_noisy = covXX + sig_n^2 *eye(size(covXX));
chol_covXX_noisy = chol(covXX_noisy);
%posterior_mean = covXXs/covXX_noisy * fx_train';
posterior_mean = (covXXs/chol_covXX_noisy)/chol_covXX_noisy' * y_train';
%posterior_cov = covXsXs - covXXs/covXX_noisy * covXXs';
posterior_cov = covXsXs - (covXXs/chol_covXX_noisy)/chol_covXX_noisy' * covXXs';
78
Kernel Methods: Gaussian Processes
Example: GP with SE covariance
l = 1.0;
sig_f = 1;
sig_n = 0.1; Standard deviation of the noise on the observation
x_test = (-10:0.05:10)';
m_f = zeros(size(x_test));
x_train = [-3 -2 2 7];
y_train = [1 0 2 -2];
[covXXInd1 covXXInd2] = meshgrid(x_train,x_train);
[covXXsInd1 covXXsInd2] = meshgrid(x_train,x_test);
[covXsXsInd1 covXsXsInd2] = meshgrid(x_test,x_test);
covXsXs = sig_f * exp(-(covXsXsInd1-covXsXsInd2).^2 ./ l.^2);
covXXs = sig_f * exp(-(covXXsInd1-covXXsInd2).^2 ./ l.^2);
covXX = sig_f * exp(-(covXXInd1-covXXInd2).^2 ./ l.^2);
covXX_noisy = covXX + sig_n^2 *eye(size(covXX));
chol_covXX_noisy = chol(covXX_noisy);
%posterior_mean = covXXs/covXX_noisy * fx_train';
posterior_mean = (covXXs/chol_covXX_noisy)/chol_covXX_noisy' * y_train';
%posterior_cov = covXsXs - covXXs/covXX_noisy * covXXs';
posterior_cov = covXsXs - (covXXs/chol_covXX_noisy)/chol_covXX_noisy' * covXXs';
79
Kernel Methods: Gaussian Processes
Example: GP with SE covariance
l = 1.0;
sig_f = 1;
sig_n = 0.1;
x_test = (-10:0.05:10)';
m_f = zeros(size(x_test));
x_train = [-3 -2 2 7];
y_train = [1 0 2 -2];
[covXXInd1 covXXInd2] = meshgrid(x_train,x_train);
[covXXsInd1 covXXsInd2] = meshgrid(x_train,x_test);
[covXsXsInd1 covXsXsInd2] = meshgrid(x_test,x_test);
covXsXs = sig_f * exp(-(covXsXsInd1-covXsXsInd2).^2 ./ l.^2);
covXXs = sig_f * exp(-(covXXsInd1-covXXsInd2).^2 ./ l.^2);
covXX = sig_f * exp(-(covXXInd1-covXXInd2).^2 ./ l.^2);
covXX_noisy = covXX + sig_n^2 *eye(size(covXX)); Add the noise to the diagonal
chol_covXX_noisy = chol(covXX_noisy);
%posterior_mean = covXXs/covXX_noisy * fx_train';
posterior_mean = (covXXs/chol_covXX_noisy)/chol_covXX_noisy' * y_train';
%posterior_cov = covXsXs - covXXs/covXX_noisy * covXXs';
posterior_cov = covXsXs - (covXXs/chol_covXX_noisy)/chol_covXX_noisy' * covXXs';
80
Kernel Methods: Gaussian Processes
of K(X,X)
Example: GP with SE covariance
Predictive distribution of a GP with SE covariance function with
parameters . The standard deviation of the noise
on the observations is .
3
2
f(x)
1
0
-1
-2
-3
-10
-8
-6
-4
-2
0
x
81
Kernel Methods: Gaussian Processes
2
4
6
8
10
Example: GP with SE covariance
Draw three samples from the predictive distribution.
4
3
2
f(x)
1
0
-1
-2
-3
-10
-8
-6
-4
-2
0
x
82
Kernel Methods: Gaussian Processes
2
4
6
8
10
Example: GP with SE covariance
Predictive distribution of a GP with SE covariance function with
parameters . The standard deviation of the noise
on the observations is .
3
2
f(x)
1
0
-1
-2
-3
-10
-8
-6
-4
-2
0
x
83
Kernel Methods: Gaussian Processes
2
4
6
8
10
Example: GP with SE covariance
Draw three samples from the predictive distribution.
4
3
2
f(x)
1
0
-1
-2
-3
-10
-8
-6
-4
-2
0
x
84
Kernel Methods: Gaussian Processes
2
4
6
8
10
Example: GP with SE covariance
Predictive distribution of a GP with SE covariance function with
parameters . The standard deviation of the noise
on the observations is .
5
4
3
2
f(x)
1
0
-1
-2
-3
-4
-10
-8
-6
-4
-2
0
x
85
Kernel Methods: Gaussian Processes
2
4
6
8
10
Example: GP with SE covariance
Draw three samples from the predictive distribution.
5
4
3
2
f(x)
1
0
-1
-2
-3
-4
-10
-8
-6
-4
-2
0
x
86
Kernel Methods: Gaussian Processes
2
4
6
8
10
Example: GP with SE covariance
Predictive distribution of a GP with SE covariance function with
parameters . The standard deviation of the noise
on the observations is .
3
2
f(x)
1
0
-1
-2
-3
-10
-8
-6
-4
-2
0
x
87
Kernel Methods: Gaussian Processes
2
4
6
8
10
Example: GP with SE covariance
Draw three samples from the predictive distribution.
3
2
f(x)
1
0
-1
-2
-3
-10
-8
-6
-4
-2
0
x
88
Kernel Methods: Gaussian Processes
2
4
6
8
10
Model Selection
For Gaussian Processes, the model selection problem boils down to
the selection of:
 The covariance function.
 The value of the parameters of the covariance function.
 The value of the standard deviation of the noise in the observations.
Once the covariance function has been selected, the parameters can
be selected maximizing the marginal likelihood (or evidence) .
The parameters of the covariance function and the standard deviation
of the noise are called the hyperparameters of the model.
89
Kernel Methods: Gaussian Processes
Marginal likelihood
 The marginal likelihood is the integral of the likelihood times the
prior:
 Under the gaussian process model the prior is Gaussian and the
likelihood is a factorized Gaussian. Therefore the logarithm of the
marginal likelihood has the following equation:
90
Kernel Methods: Gaussian Processes
Model Selection
 The hyperparameters can be set by maximizing the marginal
likelihood.
 If gradient based methods are used, the partial derivatives of the
marginal likelihood w.r.t. to the hyperparameters have to be
calculated.
 There can be multiple local maxima.
 Cross-validation methods can be used.
91
Kernel Methods: Gaussian Processes
Classification with Gaussian Processes
In the regression case we have (weight-space view):
Gaussian prior
over the weights
Gaussian posterior
over the weights
Gaussian predictive
distribution
Gaussian likelihood
of the obsertvations
We can calculate all these distributions analytically!
92
Kernel Methods: Gaussian Processes
Classification with Gaussian Processes
Gaussian prior
over the weights
Gaussian posterior
over the weights
Gaussian likelihood
of the obsertvations

93
Kernel Methods: Gaussian Processes
Gaussian predictive
distribution
Classification with Gaussian Processes
Gaussian prior
over the weights
Gaussian posterior
over the weights
Gaussian predictive
distribution
Gaussian likelihood
of the obsertvations
In classification the targets are discrete class labels, the Gaussian
likelihood is inappropriate. A sigmoid shaped likelihood is much
better! The predictive distribution does not have anymore a
simple analytical form and has to be calculated using analitycal
approximation of integrals or Monte Carlo sampling.
94
Kernel Methods: Gaussian Processes
Book Readings
 Pattern Recognition and Machine Learning (Bishop)
 Ch. 2.3.1, 6.4.1, 6.4.2
 Gaussian Processes for Machine Learning (Rasmussen, Williams)
 Ch. 2
95
Kernel Methods: Gaussian Processes
Related documents