Download and Covariance function

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine
Machine Learning – Lecture 17
Introduction to Gaussian Processes
14.07.2009
Bastian Leibe
RWTH Aachen
http://www.umic.rwth-aachen.de/multimedia
[email protected]
Many slides adapted from B. Schiele
Course Outline
• Fundamentals (2 weeks)
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine


Bayes Decision Theory
Probability Density Estimation
• Discriminative Approaches (5 weeks)


Lin. Discriminants, SVMs, Boosting
Dec. Trees, Random Forests, Model Sel.
• Graphical Models (5 weeks)




Bayesian Networks & Applications
Markov Random Fields & Applications
Exact Inference
Approximate Inference
• Regression Problems (2 weeks)

Gaussian Processes
B. Leibe
2
Recap: Sampling Idea
• Objective:
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine

Evaluate expectation of a function f(z)
w.r.t. a probability distribution p(z).
• Sampling idea

Draw L independent samples z(l) with l = 1,…,L from p(z).

This allows the expectation to be approximated by a finite sum
XL
1
f^ =
f (zl )
L
l= 1

As long as the samples z(l) are drawn independently from p(z),
then
 Unbiased estimate, independent of the dimension of z!
Slide adapted from Bernt Schiele
B. Leibe
3
Image source: C.M. Bishop, 2006
Recap: Sampling from a pdf
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine
• In general, assume we are given the pdf p(x) and the
corresponding cumulativeZdistribution:
x
F (x) =
p(z)dz
¡ 1
• To draw samples from this pdf, we can invert the
cumulative distribution function:
u » Uniform(0; 1) ) F ¡ 1 (u) » p(x)
Slide credit: Bernt Schiele
B. Leibe
4
Image source: C.M. Bishop, 2006
Recap: Rejection Sampling
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine
• Assumptions

Sampling directly from p(z) is difficult.

But we can easily evaluate p(z) (up to some norm. factor Zp):
1
p~(z)
Zp
• Idea
 We need some simpler distribution q(z) (called proposal
p(z) =

distribution) from which we can draw samples.
Choose a constant k such that: 8z : kq(z) ¸ p
~(z)
• Sampling procedure



Generate a number z0 from q(z).
Generate a number u0 from the
uniform distribution over [0,kq(z0)].
~(z0 ) reject sample, otherwise accept.
If u0 > p
Slide adapted from Bernt Schiele
B. Leibe
5
Image source: C.M. Bishop, 2006
Recap: Importance Sampling
• Approach
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine


Approximate expectations directly
(but does not enable to draw samples from p(z) directly).
Goal:
• Idea


Use a proposal distribution q(z) from which it is easy to sample.
Express expectations in the form of a finite sum over samples
{z(l)} drawn from q(z).
Importance weights
Slide adapted from Bernt Schiele
B. Leibe
6
Image source: C.M. Bishop, 2006
Recap: MCMC – Markov Chain Monte Carlo
• Overview
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine


Allows to sample from a large class of distributions.
Scales well with the dimensionality of the sample space.
• Idea

We maintain a record of the current state z(¿)

The proposal distribution depends on the current state: q(z|z(¿))

The sequence of samples forms a Markov chain z(1), z(2),…
• Approach


At each time step, we generate a candidate
sample from the proposal distribution and
accept the sample according to a criterion.
Different variants of MCMC for different
criteria.
Slide adapted from Bernt Schiele
B. Leibe
7
Image source: C.M. Bishop, 2006
Recap: MCMC – Metropolis Algorithm
• Metropolis algorithm
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine


[Metropolis et al., 1953]
Proposal distribution is symmetric: q(zA jzB ) = q(zB jzA )
The new candidate sample z* is accepted with probability
µ
¶
?
p
~
(z
)
A(z? ; z( ¿) ) = min 1;
p~(z( ¿) )
 New candidate samples always accepted if p
~(z? ) ¸ p~(z(¿) ) .
 The algorithm sometimes accepts a state with lower probability.
• Metropolis-Hastings Algorithm



Generalization: Proposal distribution not necessarily symmetric.
The new candidate sample z* is accepted with probability
µ
¶
?
( ¿) ?
p~(z )qk (z jz )
A(z? ; z( ¿) ) = min 1;
p~(z( ¿) )qk (z? jz( ¿) )
where k labels the members of the set of considered transitions.
Slide adapted from Bernt Schiele
B. Leibe
8
Recap: Gibbs Sampling
• Approach
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine


MCMC-algorithm that is simple and widely applicable.
May be seen as a special case of Metropolis-Hastings.
• Idea

Sample variable-wise: replace zi by a value drawn from the
distribution p(zi|z\i).
– This means we update one coordinate at a time.

Repeat procedure either by cycling through all variables or by
choosing the next variable.
• Properties



The algorithm always accepts!
Completely parameter free.
Can also be applied to subsets of variables.
Slide adapted from Bernt Schiele
B. Leibe
9
Topics of This Lecture
• Regression
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine




Least-squares regression
Polynomial regression
Overfitting
Maximum-likelihood regression
• Gaussian Processes: Weight Space View




Linear model
MAP estimate
Prediction
Non-linear model
• Gaussian Processes: Function space view



Definition
Prediction with noise-free observations
Prediction with noisy observations
B. Leibe
10
From Classification to Regression
• We will leave the realm of classification and turn to a
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine
different task…
• Regression

Predict a continuous function value.
Polynomial of order 0
(constant value)
Slide credit: Bernt Schiele
B. Leibe
11
Image source: C.M. Bishop, 2006
From Classification to Regression
• We will leave the realm of classification and turn to a
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine
different task…
• Regression

Predict a continuous function value.
Polynomial of order 1
(line)
Slide credit: Bernt Schiele
B. Leibe
12
Image source: C.M. Bishop, 2006
From Classification to Regression
• We will leave the realm of classification and turn to a
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine
different task…
• Regression

Predict a continuous function value.
Polynomial of order 2
(quadratic)
Slide credit: Bernt Schiele
B. Leibe
13
Image source: C.M. Bishop, 2006
From Classification to Regression
• We will leave the realm of classification and turn to a
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine
different task…
• Regression

Predict a continuous function value.
Polynomial of order 9
Massive
overfitting!
Slide credit: Bernt Schiele
B. Leibe
14
Image source: C.M. Bishop, 2006
From Classification to Regression
• In 2-class classification with a discriminant function, our
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine
goal was to find a function y(x) such that


y(x) > 0 for all data points in class 1
y(x) < 0 for all data points in class -1
• Regression


Given training data {(x1,y1),…,(xn,yn)} where xi is a training
point with desired function value yi.
Want to find a function y : Rd
x

!
R
!
7
y(x )
Should fit the training data well, but also generalize!
Slide credit: Bernt Schiele
B. Leibe
15
From Classification to Regression
• Regression
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine

Find a function
y : Rd
!
R
x
!
7
y(x )
 Generalization of binary classification to arbitrary real values.
 This suggests that some of our classification methods might be
adapted to this situation.
• First things first: Least-squares
Slide credit: Bernt Schiele
B. Leibe
16
Least-Squares Regression
• We have given
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine


Training data points:
Associated function values:
X = f x 1 2 Rd ; : : : ; x n g
Y = f y1 2 R; : : : ; yn g
• Start with linear regressor:

Try to enforce
x Ti w + w0 = yi ; 8i = 1; : : : ; n
One linear equation for each training data point / label pair.

This is the same basic setup used for least-squares classification!

– Only the values are now continuous.
Slide credit: Bernt Schiele
B. Leibe
17
Least-Squares Regression
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine
x Ti w + w0 = yi ;
• Setup
µ
xi
1

Step 1: Define
x~i =

Step 2: Rewrite
~ = yi ;
x~Ti w

Step 3: Matrix-vector notation
8i = 1; : : : ; n
¶
eT w
~= y
X
µ
~=
w
;
w
w0
¶
8i = 1; : : : ; n
with
e = [~
X
x 1 ; : : : ; x~n ]
y = [y1 ; : : : ; yn ]T

Step 4: Find least-squares solution
eT w
~ ¡ yk2 ! min
kX

Solution:
Slide credit: Bernt Schiele
eX
e T )¡ 1X
ey
~ = (X
w
B. Leibe
18
Polynomial Regression
• How can we fit arbitrary polynomials using least-squares
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine
regression?

We introduce a feature transformation as before.
T
y(x ) = w Á(x )
assume
Á0 (x) = 1
XM
=
wi Ái (x )
i= 0
basis functions


E.g.: Á(x) = (1; x; x 2 ; x 3 ) T
Fitting a cubic polynomial.
Slide credit: Bernt Schiele
B. Leibe
19
Overfitting
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine
• Example: Polynomial of degree 9
Relatively little data
Overfitting typical
Slide credit: Bernt Schiele
Enough data
Good estimate
B. Leibe
20
Image source: C.M. Bishop, 2006
What Is Happening Here?
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine
• The coefficients get very large:

Fitting the data from before with various polynomials.

Coefficients:
Slide credit: Bernt Schiele
B. Leibe
21
Image source: C.M. Bishop, 2006
What Is Happening Here?
• Obvious problems
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine


Overfitting
Numerical instability
• How can we address these in a principled way?

We use the probabilistic notion that we’ve been using
throughout the lecture.
• First step:

Least-squares regression as maximum likelihood estimation.
Slide credit: Bernt Schiele
B. Leibe
22
Probabilistic Regression
• First assumption:
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine

Our target function values y are generated by adding noise to
the function estimate:
Target function
value
y = f (x; w) + ²
Regression function
(previously y(¢))
Input value
Noise
Weights or
parameters
• Second assumption:

The noise is Gaussian distributed.
p(yjx; w; ¯) = N (yjf (x; w); ¯ ¡ 1)
Mean
Slide credit: Bernt Schiele
B. Leibe
Variance
(¯ precision)
23
Probabilistic Regression
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine
• Given

Training data points:
X = [x 1 ; : : : ; x n ] 2 Rd£ n

Associated function values:
y = [y1 ; : : : ; yn ]T
• Conditional likelihood (assuming i.i.d. data)
Yn
p(y jX ; w ; ¯) =
N (yi jf (x i ; w ); ¯ ¡ 1 ) =
i= 1
N (yi jw T Á(x i ); ¯ ¡ 1 )
i= 1
Generalized linear
regression function
 Maximize w.r.t. w, ¯
Slide credit: Bernt Schiele
Yn
B. Leibe
24
Maximum Likelihood Regression
• Simplify the log-likelihood
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine
Xn
log p(y jX ; w ; ¯) =
log N (yi jw T Á(x i ); ¯ ¡ 1 )
i= 1
Xn ·
=
i= 1
µ p ¶
¸
¯
¯
p
log
¡ (yi ¡ w T Á(x i )) 2
2
2¼
n
n
n
¯X
=
log ¯ ¡
log(2¼) ¡
(yi ¡ w T Á(x i )) 2
2
2
2 i= 1
• Gradient w.r.t. w:
Xn
r
w
log p(y jX ; w ; ¯) = ¡ ¯
(yi ¡ w T Á(x i ))Á(x i )
i= 1
Slide credit: Bernt Schiele
B. Leibe
25
Maximum Likelihood Regression
Xn
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine
r
w
log p(y jX ; w ; ¯) = ¡ ¯
(yi ¡ w T Á(x i ))Á(x i )
i= 1
• Setting the gradient to zero:
Xn
(yi ¡ w T Á(x i ))Á(x i )
0 = ¡ ¯
i= 1
"
Xn
,
Xn
yi Á(x i ) =
i= 1
#
Á(x i )Á(x i ) T w
i= 1
,
©y = ©© T w
,
w M L = (©© T ) ¡ 1 ©y
© = [Á(x 1 ); : : : ; Á(x n )]
Same as in least-squares
regression!
Slide credit: Bernt Schiele
B. Leibe
26
Regression – Two Common Approaches
• Restrict class of functions that we consider
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine



E.g. linear functions of the input or non-linear functions Á(x).
Solve with least-squares fitting or ML (maximum likelihood)
estimate.
(See previous slides)
• Bayesian modeling

Give a prior probability over all possible functions p(f).

f: possible functions, X: input data, y: output

Calculate MAP (maximum a posteriori) estimate:
Likelihood of observations
Prior over functions
p(y jX ; f )p(f )
p(f jy ; X ) =
p(y jX )
Normalization
Slide credit: Bernt Schiele
B. Leibe
27
Visualization of Bayesian Modeling
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine
• 1D example:
• Left:

Visualization of the prior p(f) (defined by Gaussian Process):
– 4 function samples are drawn from the prior distribution.


Point-wise mean is zero.
Grey area: point-wise variance over all function values.
• Right:


Posterior given 2 observations (no uncertainty of their values)
Grey area: point-wise variance of remaining function values.
Slide credit: Bernt Schiele
B. Leibe
28
Image source: Rasmussen & Williams, 2006
Topics of This Lecture
• Regression
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine




Least-squares regression
Polynomial regression
Overfitting
Maximum-likelihood regression
• Gaussian Processes: Weight Space View




Linear model
MAP estimate
Prediction
Non-linear model
• Gaussian Processes: Function space view



Definition
Prediction with noise-free observations
Prediction with noisy observations
B. Leibe
29
Gaussian Process (Informal Introduction)
• Gaussian distribution
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine

Probability distribution over scalars / vectors.
• Gaussian process (generalization of Gaussian distrib.)

Describes properties of functions.
Function: Think of a function as a long vector where each entry
specifies the function value f(xi) at a particular point xi.

Issue: How to deal with infinite number of points…

– If you ask only for properties of the function at a finite number of
points…
– Then inference in Gaussian Process gives you the same answer if
you ignore the infinitely many other points.
Slide credit: Bernt Schiele
B. Leibe
30
Gaussian Process
• Example prior over functions p(f)

Represents our prior belief about
functions before seeing any data.
Although specific functions don’t have
mean of zero, the mean of f(x) values
for any fixed x is zero (here).

Favors smooth functions
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine

– I.e. functions cannot vary too rapidly
– Smoothness is induced by the covariance function of the Gaussian
Process.

Learning in Gaussian processes
– Is mainly defined by finding suitable properties of the covariance
function.
Slide credit: Bernt Schiele
B. Leibe
31
Image source: Rasmussen & Williams, 2006
Standard Linear Model
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine
• Linear regression model with Gaussian noise
f (x) = x T w;
y = f (x) + ²
² » N (0; ¾n2 )
• Calculation of likelihood
X = (x1,…,xn)

Given input data

With corresponding output values
y1,…,yn
Yn
p(y jw ; X ) =
p(yi jx i ; w )
i= 1
Slide credit: Bernt Schiele
B. Leibe
32
Linear Model – Likelihood
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine
• Likelihood
Yn
p(y jw ; X ) =
p(yi jx i ; w )
i= 1
Yn
1
p
=
exp
2¼¾n
i= 1
1
=
exp
n
=2
(2¼¾n )
½
½
¡
(yi ¡ x Ti w ) 2
2¾n2
T
¡ jy ¡ X w j
2¾n2
2
¾
¾
p(yjw; X ) = N (X T w; ¾n2 I )
Slide credit: Bernt Schiele
B. Leibe
33
Linear Model – MAP Estimate
• Inference in Bayesian Model
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine

Calculation of posterior distribution over the weights
p(y jw ; X )p(w ; X )
p(y jw ; X )p(w jX )p(X )
p(w jy ; X ) =
=
p(y ; X )
p(y jX )p(X )
p(y jw ; X )p(w jX )
likelihood £ prior
=
=
p(y jX )
marginal likelihood



T
2
p(yjw; X ) = N (X w; ¾n I )
Likelihood:
Prior, e.g.:
p(w) = N (0; § p )
Marginal likelihood (normalization constant):
Z
p(y jX ) =
Slide credit: Bernt Schiele
p(y jw ; X )p(w )dw
B. Leibe
34
Linear Model – MAP Estimate
• Posterior
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine
p(wjy; X ) /
p(yjw; X )p(w; X )
½
¾
½
¾
1
1 T ¡1
T
T
T
= exp ¡
(y
¡
X
w
)
(y
¡
X
w
)
exp
¡
w §p w
2
2¾n
2
½
µ
¶
¾
1
1
T
T
¡ 1
¹
¹)
= exp ¡ (w ¡ w )
X
X
+
§
(w ¡ w
p
2
2
¾n

with
µ
A=
1
T
¡ 1
X
X
+
§
p
¾n2
¶
¹ =
w
1 ¡1
A Xy
¾n2
• MAP

¹ A ¡ 1)
Is simply mean of p(wjy; X ) » N ( w;
Slide credit: Bernt Schiele
B. Leibe
35
Linear Model: Predictions
• Predictions for a test case in Bayesian model
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine


Average over all possible parameter values
weighted by their posterior probability.
non-Bayesian: single parameter chosen by criterion (e.g. ML)
4
• Predictive distribution for f ? = f (x ? ) at x*

Given by averaging over all possible models:
Z
p(f ? jx ? ; y ; X ) =
p(f ? jx ? ; w )p(w jy ; X )dw
µ
= N
Slide credit: Bernt Schiele
1 T ¡1
T ¡ 1
x
A
X
y
;
x
x?
?
?A
2
¾n
B. Leibe
¶
36
Linear Model: Predictions
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine
• Predictive distribution µ
p(f ? jx ? ; y ; X ) = N
1 T ¡1
T ¡ 1
x
A
X
y
;
x
x?
?
?A
2
¾n

Predictive distribution is again Gaussian.

Mean:
¹
x T? w
– Uses MAP estimate of weight-vector

¹ =
w
¶
1 ¡1
A Xy
¾n2
¡ 1
Variance: x T
A
x?
?
– Quadratic form of the test input x* with the posterior covariance
matrix A-1.
Slide credit: Bernt Schiele
B. Leibe
37
Linear Model: Predictions
• 1D example: f(x) = w1 + w2x
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine




3 training points (crosses): X = (x 1 ; x 2 ; x 3 )
¾n = 1
Assume noise in points:
Predictive mean (solid line)
Predicted standard deviation (dotted line)
Likelihood:
p(yjw; X ) = N (X T w; ¾n2 I )
Slide credit: Bernt Schiele
Prior:
p(w) = N (0; I )
B. Leibe
Posterior:
¹ A ¡ 1)
p(wjy; X ) = N ( w;
38
Image source: Rasmussen & Williams, 2006
Non-Linear Model
• Map D-dimensional x into N-dimensional feature space:
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine
x ! Á(x)
• Linear regression in the N-dimensional space:
f (x) = Á(x) T w
• Non-linear model:

Previous analysis applies analogously by replacing X with
©(X ) = (Á(x 1); : : : ; Á(x n ))
µ
¶
1
T ¡ 1
T ¡ 1
p(f ? jx ? ; y ; X ) » N
Á(x
)
A
©(X
)y
;
Á(x
)
Á(x ? )
?
? A
2
¾n

with
1
A = 2 ©(X )©(X ) T + § ¡p 1
¾n
Slide credit: Bernt Schiele
B. Leibe
39
Topics of This Lecture
• Regression
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine




Least-squares regression
Polynomial regression
Overfitting
Maximum-likelihood regression
• Gaussian Processes: Weight Space View




Linear model
MAP estimate
Prediction
Non-linear model
• Gaussian Processes: Function space view



Definition
Prediction with noise-free observations
Prediction with noisy observations
B. Leibe
40
Function Space View
• Function space view
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine


Derive above results by performing inference in function space
directly.
We use a Gaussian process to describe a distribution over
functions.
• Definition

A Gaussian process (GP) is a collection of random variables any
finite number of which has a joint Gaussian distribution.
Slide credit: Bernt Schiele
B. Leibe
41
Gaussian Process
• A Gaussian process is completely defined by
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine

Mean function m(x) and
m(x) = E[f (x)]

Covariance function k(x,x’)
k(x; x 0) = E[(f (x) ¡ m(x)(f (x 0) ¡ m(x 0))]

We write the Gaussian process (GP)
f (x) » GP(m(x); k(x; x 0))
Slide credit: Bernt Schiele
B. Leibe
42
Gaussian Process
• Property
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine


Defined as a collection of random variables, which implies
consistency.
Consistency means
·
§ 11
(y1,y2) » N(¹,§)
§=
§ 21
– Then it must also specify
y1 » N(¹1,§11)
– If the GP specifies e.g.

§ 12
§ 22
¸
I.e. examination of a larger set of variables does not change the
distribution of a smaller set.
Slide credit: Bernt Schiele
B. Leibe
43
Gaussian Process: Example
• Example:
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine


Bayesian linear regression model: f (x) = Á(x) T w
With Gaussian prior: w » N (0; § p )
 Mean:
E[f (x)] = Á(x) T E[w] = 0
 Covariance:
E[f (x)f (x 0)] = Á(x) T E[ww T ]Á(x 0)
= Á(x) T § p Á(x 0)
Slide credit: Bernt Schiele
B. Leibe
44
Gaussian Process: Squared Exponential
• Typical covariance function
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine

Squared exponential (SE)
– Covariance function specifies the covariance between pairs of
random variables
½
¾
1
cov(f (x p ); f (x q )) = k(x p ; x q ) = exp ¡ jx p ¡ x q j 2
2
• Remarks



Covariance between the outputs is written as a function
between the inputs.
The squared exponential covariance function corresponds to a
Bayesian linear regression model with an infinite number of
basis functions.
For any positive definite covariance function k(.,.), there exists
a (possibly infinite) expansion in terms of basis functions.
Slide credit: Bernt Schiele
B. Leibe
45
Gaussian Process: Prior over Functions
• Distribution over functions:
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine



Specification of covariance function implies distribution over
functions.
I.e. we can draw samples from the distribution of functions
evaluated at a (finite) number of points.
Procedure
– We choose a number of input points X ?
– We write the corresponding covariance
matrix (e.g. using SE) element-wise:
K (X ? ; X ? )
– Then we generate a random Gaussian
vector with this covariance matrix:
f ? » N (0; K (X ? ; X ? ))
Slide credit: Bernt Schiele
B. Leibe
Example of 3 functions
46
sampled
Image source: Rasmussen & Williams, 2006
Prediction with Noise-free Observations
• Assume our observations are noise-free:
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine
f (x i ; f i )ji = 1; : : : ; ng
• Joint distribution of the training outputs f and test
outputs f* according to the prior:
·
¸
µ ·
¸¶
f
K (X ; X ) K (X ; X ? )
» N 0;
f?
K (X ? ; X ) K (X ? ; X ? )

K(X, X*) contains covariances for all pairs of training and test
points.
• To get the posterior (after including the observations)


We need to restrict the above prior to contain only those
functions which agree with the observed values.
Think of generating functions from the prior and rejecting those
that disagree with the observations (obviously prohibitive).
Slide credit: Bernt Schiele
B. Leibe
47
Prediction with Noise-free Observations
• Calculation of posterior
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine

Corresponds to conditioning the joint Gaussian prior distribution
on the observations:
f ? jX ? ; X ; f » N ( f¹? ; cov(f ? ))

with:
f¹? = K (X ? ; X )K (X ; X ) ¡ 1f
cov(f ? ) = K (X ? ; X ?) ¡ K (X ? ; X )K (X ; X ) ¡ 1K (X ; X ? )
Slide credit: Bernt Schiele
B. Leibe
48
Prediction with Noise-free Observations
• Example:
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine


Left: prior
Right: posterior using 5 noise-free observations.
Slide credit: Bernt Schiele
B. Leibe
49
Image source: Rasmussen & Williams, 2006
References and Further Reading
• Gaussian Processes are (shortly) described in Chapter
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine
6.4 of Bishop’s book.
Christopher M. Bishop
Pattern Recognition and Machine Learning
Springer, 2006
Carl E. Rasmussen, Christopher K.I. Williams
Gaussian Processes for Machine Learning
MIT Press, 2006
• A better introduction can be found in Chapters 1 and 2
of the book by Rasmussen & Williams (also available
online: http://www.gaussianprocess.org/gpml/)
B. Leibe
50
Related documents