Download Bayesian optimization - Research Group Machine Learning for

Document related concepts

Data assimilation wikipedia , lookup

Choice modelling wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Regression toward the mean wikipedia , lookup

Time series wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Transcript
Optimization of
Machine Learning Hyperparameters
Dr. Frank Hutter
Head of Emmy Noether Research Group on
Learning, Optimization, and Automated Algorithm Design
Computer Science Institute
University of Freiburg, Germany
July 2014
Recap: Bayesian optimization
2
Today’s Learning Goals
• After today’s lecture, you can …
–
–
–
–
Derive the principles behind Bayesian optimization
Derive Bayesian linear regression
Derive the posterior distribution of Gaussian processes
Explain various acquisition functions
– Discuss some limitations of GP-based Bayesian optimization
– Explain regression trees and random forests
3
Outline of Today’s Class
• Bayesian optimization
– Bayesian models:
Bayesian linear regression & Gaussian processes
– Acquisition functions
• Extensions of Bayesian optimization for AutoML
– Bayesian optimization with random forests
– Applications
4
Bayes rule in Bayesian optimization
• Denote the observed data as
• Denote our prior over functions as
• Then the posterior over functions is:
posterior
likelihood

prior
5
Bayesian linear regression & Gaussian processes
• Acknowledgement:
The following slides are taken from Phillip Hennig’s
tutorial on Gaussian processes in the machine learning
summer school 2013
• All of Phillip’s slides are online:
http://mlss.tuebingen.mpg.de/hennig_slides1.pdf
• Phillip’s website also has video lectures and more slides:
http://www.is.tuebingen.mpg.de/nc/employee/details/
phennig.html
6
Carl Friedrich Gauss (1777–1855)
Paying Tolls with A Bell
(x−µ)2
1
f (x) = √ e− 2σ2
σ 2π
2
,
The Gaussian distribution
Multivariate Form
8
N (x; µ, Σ) =
1
1
exp [− (x − µ)⊺ Σ−1 (x − µ)]
2
(2π)N /2 ∣Σ∣1/2
6
4
▸
▸
µ2
0
−2
−4
−4 −2
0
µ1
4
6
x, µ ∈ RN , Σ ∈ RN ×N
Σ is positive semidefinite, i.e.
▸
v ⊺ Σv ≥ 0 for all v ∈ RN
▸
Hermitian, all eigenvalues ≥ 0
8
3
,
Why Gaussian?
an experiment
40
20
▸
▸
0
−0.1 −5 ⋅ 10−2
0
5 ⋅ 10−2 0.1
nothing in the real world is Gaussian (except sums of i.i.d. variables)
But nothing in the real world is linear either!
Gaussians are for inference what linear maps are for algebra.
4
,
Closure Under Multiplication
multiple Gaussian factors form a Gaussian
N (x; a, A)N (x; b, B) = N (x; c, C)N (a; b, A + B)
8
C ∶= (A−1 + B −1 )−1
c ∶= C(A−1 a + B −1 b)
6
4
µ2
0
−2
−4
−4
−2
0
µ1
4
6
8
5
,
Closure Under Multiplication
multiple Gaussian factors form a Gaussian
N (x; a, A)N (x; b, B) = N (x; c, C)N (a; b, A + B)
8
C ∶= (A−1 + B −1 )−1
c ∶= C(A−1 a + B −1 b)
6
4
µ2
0
−2
−4
−4
−2
0
µ1
4
6
8
5
,
Closure Under Multiplication
multiple Gaussian factors form a Gaussian
N (x; a, A)N (x; b, B) = N (x; c, C)N (a; b, A + B)
8
C ∶= (A−1 + B −1 )−1
c ∶= C(A−1 a + B −1 b)
6
4
µ2
0
−2
−4
−4
−2
0
µ1
4
6
8
5
,
Closure under Linear Maps
Linear Maps of Gaussians are Gaussians
8
6
4
µ2
⇒
−4
−4
p(Az) = N (Az, Aµ, AΣA⊺ )
Here: A = [1, −0.5]
0
−2
p(z) = N (z; µ, Σ)
−2
0
µ1
4
6
8
6
,
Closure under Marginalization
projections of Gaussians are Gaussian
▸
projection with A = (1
0)
x
µx
Σxx
∫ N [(y ) ; (µ ) , (Σ
y
yx
Σxy
)] dy = N (x; µx , Σxx )
Σyy
8
6
▸
4
µ2
0
−2
−4
−4 −2
▸
0
µ1
4
6
this is the sum rule
∫ p(x, y) dy = ∫ p(y ∣ x)p(x) dy = p(x)
so every finite-dim Gaussian is a
marginal of infinitely many more
8
7
,
Closure under Conditioning
cuts through Gaussians are Gaussians
p(x ∣ y) =
p(x, y)
−1
= N (x; µx + Σxy Σ−1
yy (y − µy ), Σxx − Σxy Σyy Σyx )
p(y)
8
6
4
▸
µ2
▸
0
−2
−4
−4 −2
0
µ1
4
6
this is the product rule
so Gaussians are closed under
the rules of probability
8
8
,
What can we do with this?
linear regression
given y ∈ RN , p(y ∣ f ), what’s f ?
20
y
10
0
−10
−8
−6
−4
−2
0
x
2
4
6
8
10
,
A prior
over linear functions
f (x) = w1 + w2 x = φ⊺x w
1
φx = ( )
x
p(w) = N (w; µ, Σ)
20
p(f ) = N (f ; φ⊺x µ, φ⊺x Σφx )
2
10
0
−2
0
−2
0
2
−10
−5
0
5
11
,
A prior
over linear functions
f (x) = w1 + w2 x = φ⊺x w
1
φx = ( )
x
p(w) = N (w; µ, Σ)
20
p(f ) = N (f ; φ⊺x µ, φ⊺x Σφx )
2
10
0
−2
0
−2
0
2
−10
−5
0
5
12
,
The posterior
over linear functions
p(y ∣ w, φX ) = N (y; φ⊺X w, σ 2 I)
p(w ∣ y, φX ) = N (w; µ + ΣφX (φ⊺X ΣφX + σ 2 I)−1 (y − φ⊺X µ),
Σ − ΣφX (φ⊺X ΣφX + σ 2 I)−1 φ⊺X Σ)φx
20
2
10
0
−2
0
−2
0
2
−10
−5
0
5
13
,
The posterior
over linear functions
p(y ∣ w, φX ) = N (y; φ⊺X w, σ 2 I)
p(fx ∣ y, φX ) = N (fx ; φ⊺x µ + φ⊺x ΣφX (φ⊺X ΣφX + σ 2 I)−1 (y − φ⊺X µ),
φ⊺x Σφx − φ⊺x ΣφX (φ⊺X ΣφX + σ 2 I)−1 φ⊺X Σφx
20
2
10
0
−2
0
−2
0
2
−10
−5
0
5
13
,
% prior
F
=
phi
=
mu
=
Sigma =
on w
2;
@(a)(bsxfun(@power,a,0:F-1));
zeros(F,1);
eye(F);
% prior
n
=
phix =
m
=
kxx
=
s
=
stdpi =
on f (x)
100; x = linspace(-6,6,n)’;
% ‘test’ points
phi(x);
% features of x
phix * mu;
phix * Sigma * phix’;
% p(fx ) = N (m, kxx )
bsxfun(@plus,m,chol(kxx + 1.0e-8 * eye(n))’ * randn(n,3)); % samples from prior
sqrt(diag(kxx));
% marginal stddev, for plotting
load(’data.mat’); N = length(Y);
% prior
phiX =
M
=
kXX
=
on Y = fX + phi(X);
phiX * mu;
phiX * Sigma * phiX’;
G
R
= kXX + sigma^2 * eye(N);
= chol(G);
kxX
A
= phix * Sigma * phiX’;
= kxX / R;
% number of features
% φ(a) = [1; a]
% p(w) = N (µ, Σ)
% gives Y,X,sigma
% features of data
% p(fX ) = N (M, kXX )
% p(Y ) = N (M, kXX + σ 2 I)
% most expensive step: O(N 3 )
% cov(fx , fX ) = kxX
% pre-compute for re-use
mpost = m + A * (R’ \ (Y-M));
% p(fx ∣ Y ) = N (m + kxX (kXX + σ 2 I)−1 (Y − M ),
vpost = kxx - A * A’;
% kxx − kxX (kXX + σ 2 I)−1 kXx )
spost = bsxfun(@plus,mpost,chol(vpost + 1.0e-8 * eye(n))’ * randn(n,3));
% samples
stdpo = sqrt(diag(vpost));
% marginal stddev, for plotting
14
,
A More Realistic Dataset
General Linear Regression
f (x) = φ⊺x w
20
?
y
10
0
−10
−8
−6
−4
−2
0
x
2
4
6
8
15
,
f (x) = w1 + w2 x = φ⊺x w
1
φx ∶= ( )
x
16
,
% prior
F
=
phi
=
mu
=
Sigma =
on w
2;
@(a)(bsxfun(@power,a,0:F-1));
zeros(F,1);
eye(F);
% prior
n
=
phix =
m
=
kxx
=
s
=
stdpi =
on f (x)
100; x = linspace(-6,6,n)’;
% ‘test’ points
phi(x);
% features of x
phix * mu;
phix * Sigma * phix’;
% p(fx ) = N (m, kxx )
bsxfun(@plus,m,chol(kxx + 1.0e-8 * eye(n))’ * randn(n,3)); % samples from prior
sqrt(diag(kxx));
% marginal stddev, for plotting
load(’data.mat’); N = length(Y);
% prior
phiX =
M
=
kXX
=
on Y = fX + phi(X);
phiX * mu;
phiX * Sigma * phiX’;
G
R
= kXX + sigma^2 * eye(N);
= chol(G);
kxX
A
= phix * Sigma * phiX’;
= kxX / R;
% number of features
% φ(a) = [1; a]
% p(w) = N (µ, Σ)
% gives Y,X,sigma
% features of data
% p(fX ) = N (M, kXX )
% p(Y ) = N (M, kXX + σ 2 I)
% most expensive step: O(N 3 )
% cov(fx , fX ) = kxX
% pre-compute for re-use
mpost = m + A * (R’ \ (Y-M));
% p(fx ∣ Y ) = N (m + kxX (kXX + σ 2 I)−1 (Y − M ),
vpost = kxx - A * A’;
% kxx − kxX (kXX + σ 2 I)−1 kXx )
spost = bsxfun(@plus,mpost,chol(vpost + 1.0e-8 * eye(n))’ * randn(n,3));
% samples
stdpo = sqrt(diag(vpost));
% marginal stddev, for plotting
17
,
Cubic Regression
phi = @(a)(bsxfun(@power,a,[0:3]));
20
φ(x) = (1
f (x) = φ(x)⊺ w
x
x.2
⊺
x.3 )
10
0
−10
−8
−6
−4
−2
0
2
4
6
8
18
,
Cubic Regression
phi = @(a)(bsxfun(@power,a,[0:3]));
20
φ(x) = (1
f (x) = φ(x)⊺ w
x
x.2
⊺
x.3 )
10
0
−10
−8
−6
−4
−2
0
2
4
6
8
18
,
Septic Regression ?
phi = @(a)(bsxfun(@power,a,[0:7]));
20
f (x) = φ(x)⊺ w
φ(x) = (1
x
x.2
⋯
⊺
x.7 )
10
0
−10
−8
−6
−4
−2
0
2
4
6
8
19
,
Septic Regression ?
phi = @(a)(bsxfun(@power,a,[0:7]));
20
f (x) = φ(x)⊺ w
φ(x) = (1
x
x.2
⋯
⊺
x.7 )
10
0
−10
−8
−6
−4
−2
0
2
4
6
8
19
,
Fourier Regression
phi = @(a)(2 * [cos(bsxfun(@times,a/8,[0:8])), sin(bsxfun(@times,a/8,[1:8]))]);
φ(x) = (cos(x)
cos(2x)
cos(3x)
...
sin(x)
sin(2x)
⊺
. . .)
20
10
0
−10
−8
−6
−4
−2
0
2
4
6
8
20
,
Fourier Regression
phi = @(a)(2 * [cos(bsxfun(@times,a/8,[0:8])), sin(bsxfun(@times,a/8,[1:8]))]);
φ(x) = (cos(x)
cos(2x)
cos(3x)
...
sin(x)
sin(2x)
⊺
. . .)
20
10
0
−10
−8
−6
−4
−2
0
2
4
6
8
20
,
Step Regression
phi = @(a)(-1 + 2 * bsxfun(@lt,a,linspace(-8,8,16)));
φ(x) = −1 + 2 (θ(x − 8)
20
θ(8 − x)
θ(x − 7)
θ(7 − x)
. . .)
⊺
10
0
−10
−8
−6
−4
−2
0
2
4
6
8
21
,
Step Regression
phi = @(a)(-1 + 2 * bsxfun(@lt,a,linspace(-8,8,16)));
φ(x) = −1 + 2 (θ(x − 8)
20
θ(8 − x)
θ(x − 7)
θ(7 − x)
. . .)
⊺
10
0
−10
−8
−6
−4
−2
0
2
4
6
8
21
,
V Regression
phi = @(a)(bsxfun(@minus,abs(bsxfun(@minus,a,linspace(-8,8,16))),linspace(-8,8,16)));
20
φ(x) = (∣x − 8∣ + 8
∣x − 7∣ + 7
∣x − 6∣ + 6
. . .)
⊺
10
0
−10
−8
−6
−4
−2
0
2
4
6
8
23
,
V Regression
phi = @(a)(bsxfun(@minus,abs(bsxfun(@minus,a,linspace(-8,8,16))),linspace(-8,8,16)));
20
φ(x) = (∣x − 8∣ + 8
∣x − 7∣ + 7
∣x − 6∣ + 6
. . .)
⊺
10
0
−10
−8
−6
−4
−2
0
2
4
6
8
23
,
Eiffel Tower Regression
phi = @(a)(exp(-abs(bsxfun(@minus,a,[-8:1:8]))));
20
φ(x) = (e−∣x−8∣
e−∣x−7∣
⊺
e−∣x−6∣
. . .)
2
4
10
0
−10
−8
−6
−4
−2
0
6
8
25
,
Eiffel Tower Regression
phi = @(a)(exp(-abs(bsxfun(@minus,a,[-8:1:8]))));
20
φ(x) = (e−∣x−8∣
e−∣x−7∣
⊺
e−∣x−6∣
. . .)
2
4
10
0
−10
−8
−6
−4
−2
0
6
8
25
,
Bell Curve Regression
phi = @(a)(exp(-0.5 * bsxfun(@minus,a,[-8:1:8]).^2));
φ(x) = (e− 2 (x−8)
1
20
2
e− 2 (x−7)
2
1
e− 2 (x−6)
1
2
⊺
. . .)
10
0
−10
−8
−6
−4
−2
0
2
4
6
8
26
,
Bell Curve Regression
phi = @(a)(exp(-0.5 * bsxfun(@minus,a,[-8:1:8]).^2));
φ(x) = (e− 2 (x−8)
1
20
2
e− 2 (x−7)
2
1
e− 2 (x−6)
1
2
⊺
. . .)
10
0
−10
−8
−6
−4
−2
0
2
4
6
8
26
,
Multiple Inputs
all this works for in multiple dimensions, too
φ ∶ RN _ R
f ∶ RN _ R
27
,
Multiple Inputs
all this works for in multiple dimensions, too
28
,
How many features should we use?
let’s look at that algebra again
p(fx ∣ y, φX ) = N (fx ; φ⊺x µ + φ⊺x ΣφX (φ⊺X ΣφX + σ 2 I)−1 (y − φ⊺X µ),
φ⊺x Σφx − φ⊺x ΣφX (φ⊺X ΣφX + σ 2 I)−1 φ⊺X Σφx )
▸
▸
there’s no lonely φ in there
all objects involving φ are of the form
▸
▸
▸
▸
φ⊺ µ — the mean function
φ⊺ Σφ — the kernel
once these are known, cost is independent of the number of features
remember the code:
M
m
kXX
kxx
=
=
=
=
phiX
phix
phiX
phix
*
*
*
*
mu;
mu;
Sigma * phiX’;
Sigma * phix’;
kxX
= phix * Sigma * phiX’;
% p(fX ) = N (M, kXX )
% p(fx ) = N (m, kxx )
% cov(fx , fX ) = kxX
32
,
% prior
F
=
phi
=
mu
=
Sigma =
on w
2;
@(a)(bsxfun(@power,a,0:F-1));
zeros(F,1);
eye(F);
% prior
n
=
phix =
m
=
kxx
=
s
=
stdpi =
on f (x)
100; x = linspace(-6,6,n)’;
% ‘test’ points
phi(x);
% features of x
phix * mu;
phix * Sigma * phix’;
% p(fx ) = N (m, kxx )
bsxfun(@plus,m,chol(kxx + 1.0e-8 * eye(n))’ * randn(n,3)); % samples from prior
sqrt(diag(kxx));
% marginal stddev, for plotting
load(’data.mat’); N = length(Y);
% prior
phiX =
M
=
kXX
=
on Y = fX + phi(X);
phiX * mu;
phiX * Sigma * phiX’;
G
R
= kXX + sigma^2 * eye(N);
= chol(G);
kxX
A
= phix * Sigma * phiX’;
= kxX / R;
% number of features
% φ(a) = [1; a]
% p(w) = N (µ, Σ)
% gives Y,X,sigma
% features of data
% p(fX ) = N (M, kXX )
% p(Y ) = N (M, kXX + σ 2 I)
% most expensive step: O(N 3 )
% cov(fx , fX ) = kxX
% pre-compute for re-use
mpost = m + A * (R’ \ (Y-M));
% p(fx ∣ Y ) = N (m + kxX (kXX + σ 2 I)−1 (Y − M ),
vpost = kxx - A * A’;
% kxx − kxX (kXX + σ 2 I)−1 kXx )
spost = bsxfun(@plus,mpost,chol(vpost + 1.0e-8 * eye(n))’ * randn(n,3));
% samples
stdpo = sqrt(diag(vpost));
% marginal stddev, for plotting
33
,
% prior
F
=
phi
=
k
=
mu
=
2;
@(a)(bsxfun(@power,a,0:F));
@(a,b)(phi(a)’ * phi(b));
@(a)(zeros(size(a,1)));
% number of features
% φ(a) = [1; a]
% kernel
% mean function
% belief on f (x)
n
= 100; x = linspace(-6,6,n)’;
% ‘test’ points
m
= mu(x);
kxx
= k(x,x);
% p(fx ) = N (m, kxx )
s
= bsxfun(@plus,m,chol(kxx + 1.0e-8 * eye(n))’ * randn(n,3)); % samples from prior
stdpi = sqrt(diag(kxx));
% marginal stddev, for plotting
load(’data.mat’); N = length(Y);
% prior on Y = fX + M
= mu(X);
kXX
= k(X,X);
G
R
= kXX + sigma^2 * eye(N);
= chol(G);
kxX
A
= k(x,X);
= kxX / R;
% gives Y,X,sigma
% p(fX ) = N (M, kXX )
% p(Y ) = N (M, kXX + σ 2 I)
% most expensive step: O(N 3 )
% cov(fx , fX ) = kxX
% pre-compute for re-use
mpost = m + A * (R’ \ (Y-M));
% p(fx ∣ Y ) = N (m + kxX (kXX + σ 2 I)−1 (Y − M ),
vpost = kxx - A * A’;
% kxx − kxX (kXX + σ 2 I)−1 kXx )
spost = bsxfun(@plus,mpost,chol(vpost + 1.0e-8 * eye(n))’ * randn(n,3));
% samples
stdpo = sqrt(diag(vpost));
% marginal stddev, for plotting
34
,
Exponentiated Squares
phi = @(a)(exp(-0.5 * bsxfun(@minus,a,linspace(-8,8,10)).^2 ./ell.^2));
20
10
0
−10
−8
−6
−4
−2
0
2
4
6
8
37
,
Exponentiated Squares
phi = @(a)(exp(-0.5 * bsxfun(@minus,a,linspace(-8,8,30)).^2 ./ell.^2));
20
10
0
−10
−8
−6
−4
−2
0
2
4
6
8
37
,
Exponentiated Squares
k = @(a,b)(5*exp(-0.25*bsxfun(@minus,a,b’).^2));
20
10
0
−10
▸
−8
−6
−4
−2
0
2
4
6
8
aka. radial basis function, square(d)-exponential kernel
37
,
Exponentiated Squares
k = @(a,b)(5*exp(-0.25*bsxfun(@minus,a,b’).^2));
20
10
0
−10
▸
−8
−6
−4
−2
0
2
4
6
8
aka. radial basis function, square(d)-exponential kernel
37
,
What just happened?
Gaussian process priors
Definition
A function k ∶ X × X _ R is a Mercer kernel if, for any finite collection
X = [x1 , . . . , xN ], the matrix kXX ∈ RN ×N with elements
kXX,(i,j) = k(xi , xj ) is positive semidefinite.
Definition
Let µ ∶ X _ R be any function, k ∶ X × X _ R be a Mercer kernel.
A Gaussian process p(f ) = GP(f ; µ, k) is a probability distribution over
the function f ∶ X _ R, such that every finite restriction to function values
fX ∶= [fx1 , . . . , fxN ] is a Gaussian distribution p(fX ) = N (fX ; µX , kXX ).
39
,
Samples from a zero-mean Gaussian process
(Source: GPML, Figure 2.2)
• Samples from the prior are zero-mean, with values
drawn from a multivariate Gaussian distribution
• The kernel/covariance function k tells
us how correlated the function values at two points are
7
Kernels
• The dot-product kernel
• The squared exponential kernel
• There are many kernels!
– See, e.g. “Learning with kernels” by Bernhard Schölkopf, 1998
– Kernels over sentences, images, grammars, graphs, etc.
8
Kernels have hyperparameters
E.g., length scale hyperparameter  in the squared exponential kernel
9
Samples from the prior and the posterior
• The joint distribution of the observed function values f1:N
and a new function value f* is multivariate Gaussian
• Then, the posterior
is also a Gaussian
10
Exercise: derive the posterior
• The joint distribution of the observed function values f1:N
and a new function value f* is multivariate Gaussian
• Then, the posterior
is also a Gaussian
11
The predictive posterior distribution
The posterior Gaussian process has predictive distribution
, where
(derivation: exercise)
12
The predictive posterior under noise
The posterior Gaussian process has predictive distribution
, where
(derivation: exercise)
13
Computational complexity of GPs
• Let t denote the number of data points in the GP
• Fitting: inverting the kernel matrix costs time O(t3)
– In practice, we also need to optimize/sample the kernel
hyperparameters  (such as the kernel’s length scale)
– Evaluating k different values of : O(k  t3)
• Predictions of the variance: O(t2)
• Predictions of the mean: O(t)
The posterior Gaussian process has predictive distribution
, where
14
Outline of Today’s Class
• Bayesian optimization
– Bayesian models:
Bayesian linear regression & Gaussian processes
– Acquisition functions
• Extensions of Bayesian optimization for AutoML
– Bayesian optimization with random forests
– Applications
15
The acquisition function
• Given a posterior model
, which point should
we select next to find the maximum of f ?
• We want to trade off
– Exploitation (sampling where we expect
the function to be high)
vs.
– Exploration (sampling where we’re uncertain
about the function value)
• Various acquisition functions achieve this tradeoff
16
Probability of Improvement
17
Expected Improvement
(derivation of the closed-form solution: exercise)
18
Upper Confidence Bound (UBC)
19
Entropy Search
[Hennig & Schuler, JMLR 2012]
• Compute a probability distribution
over which input x is optimal
• Acquisition function: try to push this probability
distribution as close to a delta distribution as possible
• One of the most powerful acquisition functions
– Can choose to actively evaluate in one region of the space to
learning something about a different region of the space
– Example application in hyperparameter optimization:
evaluate a hyperparameter setting based on a
subset of data points (cheap) to learn about
what is the best setting for all data
20
Putting it all Together
• How to optimize the acquisition function?
– Subsidiary optimization method
– Important: in that subsidiary optimization,
function evaluations are cheap (just predictions of the GP)
• How to update the GP?
– Subsidiary optimization method to set kernel hyperparameters
– Function evaluations are also cheap (just fitting the GP)
21
Summary of Bayesian Optimization
• Bayesian optimization integrates
– prior information and
– the likelihood of the observed data
• It uses quite involved computation to select which
function value to evaluate next
– Thus, it’s most useful for expensive blackbox functions
22
Outline of Today’s Class
• Bayesian optimization
– Bayesian models:
Bayesian linear regression & Gaussian processes
– Acquisition functions
• Extensions of Bayesian optimization for AutoML
– Bayesian optimization with random forests
– Applications
23
Motivation: AutoML
• Machine Learning has celebrated substantial successes
• But it requires human experts to
–
–
–
–
–
–
Preprocess the data
Perform feature selection
Select a model family
Optimize hyperparameters
…
Determine the effect of hyperparameters / choice of model
• AutoML: taking the human expert out of the loop
“Civilization advances by extending the number of important
operations which we can perform without thinking of them”
(Alfred North Whitehead)
24
Motivation: AutoML
• First workshop on AutoML at the
International Conference on Machine Learning (ICML)
– Beijing, China, two weeks ago
– Some topics:
•
•
•
•
•
•
Hyperparameter optimization
Combined search over model types & their hyperparameters
Combined architecture and hyperparameter optimization
Prediction of learning curves of deep neural networks
Automated learning of ensembles
Learning across data sets
• AutoML challenge, organized by Isabelle Guyon
– In 5 phases, September – April
– Room for MSc projects/theses
25
Types of Hyperparameters
• Numerical hyperparameters
– Continuous
– Integer
• Categorical: hyperparameters
– Finite domain, unordered
– Special case: Boolean
• Examples in neural networks
–
–
–
–
Continuous: learning rate, momentum, regularization, etc
Integer: #neurons, #layers, # of SGD steps
Categorical: activation function  {tanh, ReLu, or sigmoid}
Boolean: use preprocessing or not
26
Conditional Hyperparameters
• Conditional hyperparameters are only active if certain
other hyperparameters take certain values
– The hyperparameters they depend on are called their parents
• Examples in neural networks
– All hyperparameters in layer K are only active if the network
has at least K layers
– E.g., “Learning rate in layer 3” is a conditional parameter with
parent “network depth”
• Examples in model selection
– Select between algorithms A and B and their hyperparameters
– Then A’s hyperparameters are only active if we select A
27
Details on Conditional Hyperparameters
• Conditional hyperparameters can be parents themselves
– The hyperparameter space is often tree-structured
– Sometimes, hyperparameters depend on multiple parents; the
space is then structured as a directed acyclic graph (DAG)
• Semantics:
– If a hyperparameter is not active it does not matter which
value we select for it
– The learning algorithm will not even inspect the
hyperparameter
28
Bayesian Optimization for AutoML
• Problems for standard Gaussian Process (GP) approach:
– Complex parameter space
• High-dimensional (low effective dimensionality)
• Categorical parameters: finite domain, unordered, e.g.,
activation function  {tanh, ReLu, or sigmoid}
• Structure: e.g., “Learning rate in layer 3” is a
conditional parameter with parent “network depth”
– Noise: sometimes heteroscedastic, large, non-Gaussian
– Robustness (usability out of the box)
– Model overhead (budget is runtime, not #function evaluations)
• One solution: random forests [Breiman, '01]
– Adapted to yield uncertainty estimates
29
Random Forests (RFs)
• RFs for classification = bagging + decision trees
• You learned about decision trees before
• You also heard about bagging before
–
–
–
–
Take T bootstrap samples of the data (sample with repetitions)
For each bootstrap sample, fit a machine learning model
Average the T predictions
Effective method to reduce over-fitting
• Bagging performs better with more uncorrelated models
– Trees are a great model for bagging:
individual trees overfit to different characteristics of the
bootstrap samples, and are thus only weakly correlated
30
Random Forests (RFs)
• Bagging is better the more uncorrelated its models are
• Idea: make trees more correlated using randomness
beyond the bootstrap sampling
– E.g., perturb data slightly in each tree
– E.g., in each split only allow splits on a random subset of
variables (only evaluate splitting criterion on those)
– E.g., in each split only allow a
random subset of split points for each split variable
31
Random Forests (RFs)
• Invented by Leo Breiman in 2001
• Combine bagging & decision trees
– While trees substantially overfit,
RFs don’t typically overfit (much)
– RFs typically outperform
Leo Breiman
pruned decision trees
American statistician
1928 – 2005
– RFs are the most robust
off-the-shelf supervised learning
Image source: Wikipedia
method (together with boosted trees)
– Many ML competitions have been won with straight-up RFs
– Came from statistics literature, initially not very popular in ML
(not kernel-based, not neural networks)
32
Random Forests (RFs)
• One of the most frequently cited papers in ML
33
Regression Trees
• You have learned about decision trees for classification
– Prediction: majority class label in selected leaf
– Splitting criterion: information gain
• Regression trees: the analogue for regression
– Prediction: mean numerical value in selected leaf
– Splitting criterion: minimize squared differences to
means in resulting left child R1 and right child R2
(resulting from choosing a spit variable j and split point s)
34
Regression Tree Training
– In each internal node: only store split criterion used
– In each leaf: store mean of runtimes
param3  {blue, green}
param3  {red}
feature2 ≤ 3.5
3.7
feature2 > 3.5
1.65
…
35
Regression Tree Predictions
E.g. xn+1 = (true, 4.7, red)
– Walk down tree, return mean runtime stored in leaf  1.65
param3  {blue, green}
param3  {red}
feature2 ≤ 3.5
3.7
feature2 > 3.5
1.65
…
36
Random Forests: Sets of Regression Trees
…
• Training
– Draw T bootstrap samples of the data
– For each bootstrap sample, fit a randomized regression tree
• Prediction
– Predict with each of the T trees
– Return empirical mean and variance across these T predictions
• Complexity for N data points and T trees
– Training: O(TN log2 N)
– Prediction: O(Tlog N)
• One of the best off-the-shelf learning methods available
37
SMAC in a Nutshell
[Hutter, Hoos, Leyton-Brown, 2009-2014]
SMAC: Sequential Model-Based Algorithm Configuration
repeat
construct RF model to predict performance
use that model to select promising configurations
compare each selected configuration against the best known
until time budget exhausted
• Distributed SMAC
– Maintain queue of promising configurations
– Compare these to * on distributed worker cores
38
Outline of Today’s Class
• Bayesian optimization
– Bayesian models:
Bayesian linear regression & Gaussian processes
– Acquisition functions
• Extensions of Bayesian optimization for AutoML
– Bayesian optimization with random forests
– Applications
39
Application to Auto-WEKA: off-the-shelf ML
[Thornton, Hutter, Hoos & Leyton-Brown, KDD'13]
WEKA [Witten et al, 1999-current]
– most widely used off-the-shelf machine learning package
– over 20,000 citations on Google scholar
Java implementation of a broad range of methods
– 27 base classifiers (with up to 10 parameters each)
– 10 meta-methods
– 2 ensemble methods
Different methods work best on different data sets
– Want a true off-the-shelf solution:
Learn
40
WEKA’s configuration space
Base classifiers
– 27 choices, each with subparameters
Hierarchical structure on top of base classifiers
– In total: 768 parameters, 1047 configurations
– Optimize cross-validation performance over this space using SMAC
41
Auto-WEKA: Results
Auto-WEKA performs better than best base classifier
–
–
–
–
Even when “best classifier” uses an oracle
Especially on the 8 largest datasets
In 6/21 cases more than 10% reductions in relative error
Time requirements: 30h on 4 cores
Comparison to full grid search
– Union of grids over parameters of all 27 base classifiers
– Auto-WEKA is 100 times faster
– Auto-WEKA has better generalization performance in 15/21 cases
Auto-WEKA based on SMAC vs. TPE [Bergstra et al, 2011]
– SMAC yielded better CV performance in 19/21 cases
– SMAC yielded better generalization performance in 14/21 cases
– Differences usually small, in 3 cases substantial (SMAC better)
42
Auto-WEKA Discussion
• Off-the-shelf machine learning tools are now available
– Expert understanding of ML techniques
not required to use them
– Users still need to provide good features
Learn
• Auto-WEKA is available online: automl.org/autoweka
• Ongoing work
– Wrappers from several programming languages
– Reason across datasets to jump-start Auto-WEKA
– Reduce need for manual feature engineering
via representation learning (Auto-Deep)
43
Comparing Hyperparameter Optimizers
[Eggensperger, Feurer, Hutter, Bergstra, Snoek, Hoos & Leyton-Brown, 2013]
• Hyperparameter optimization library: automl.org/hpolib
– Benchmarks
•
•
•
•
Artificial test functions (for quick debugging)
Low-dimensional: logistic regression, online LDA, structured SVM
Medium-dimensional: neural network, deep network
High-dimensional: Auto-WEKA
– Optimizers
• SMAC [Hutter et al, '11], based on random forests
• Spearmint [Snoek et al, '12], based on Gaussian processes
• TPE [Bergstra et al, '11], based on density estimators (EDA algorithm)
– Results
• Spearmint performs best for low-dimensional continuous problems
• SMAC performs best for high-dimensional structured optimization,
e.g. combined architecture search and hyperparamter optimization
44
Summary
• We discussed
– GP-based Bayesian optimization
– Random forests
– Applications in AutoML
• Learning goals: you can now …
–
–
–
–
–
Derive the principle behind Bayesian optimization
Explain various acquisition functions
Explain Bayesian linear regression and Gaussian processes
Explain regression trees and random forests
Discuss some limitations of GP-based Bayesian optimization
45
Derivation: posterior predictive distribution
46