* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Bayesian optimization - Research Group Machine Learning for
Survey
Document related concepts
Transcript
Optimization of Machine Learning Hyperparameters Dr. Frank Hutter Head of Emmy Noether Research Group on Learning, Optimization, and Automated Algorithm Design Computer Science Institute University of Freiburg, Germany July 2014 Recap: Bayesian optimization 2 Today’s Learning Goals • After today’s lecture, you can … – – – – Derive the principles behind Bayesian optimization Derive Bayesian linear regression Derive the posterior distribution of Gaussian processes Explain various acquisition functions – Discuss some limitations of GP-based Bayesian optimization – Explain regression trees and random forests 3 Outline of Today’s Class • Bayesian optimization – Bayesian models: Bayesian linear regression & Gaussian processes – Acquisition functions • Extensions of Bayesian optimization for AutoML – Bayesian optimization with random forests – Applications 4 Bayes rule in Bayesian optimization • Denote the observed data as • Denote our prior over functions as • Then the posterior over functions is: posterior likelihood prior 5 Bayesian linear regression & Gaussian processes • Acknowledgement: The following slides are taken from Phillip Hennig’s tutorial on Gaussian processes in the machine learning summer school 2013 • All of Phillip’s slides are online: http://mlss.tuebingen.mpg.de/hennig_slides1.pdf • Phillip’s website also has video lectures and more slides: http://www.is.tuebingen.mpg.de/nc/employee/details/ phennig.html 6 Carl Friedrich Gauss (1777–1855) Paying Tolls with A Bell (x−µ)2 1 f (x) = √ e− 2σ2 σ 2π 2 , The Gaussian distribution Multivariate Form 8 N (x; µ, Σ) = 1 1 exp [− (x − µ)⊺ Σ−1 (x − µ)] 2 (2π)N /2 ∣Σ∣1/2 6 4 ▸ ▸ µ2 0 −2 −4 −4 −2 0 µ1 4 6 x, µ ∈ RN , Σ ∈ RN ×N Σ is positive semidefinite, i.e. ▸ v ⊺ Σv ≥ 0 for all v ∈ RN ▸ Hermitian, all eigenvalues ≥ 0 8 3 , Why Gaussian? an experiment 40 20 ▸ ▸ 0 −0.1 −5 ⋅ 10−2 0 5 ⋅ 10−2 0.1 nothing in the real world is Gaussian (except sums of i.i.d. variables) But nothing in the real world is linear either! Gaussians are for inference what linear maps are for algebra. 4 , Closure Under Multiplication multiple Gaussian factors form a Gaussian N (x; a, A)N (x; b, B) = N (x; c, C)N (a; b, A + B) 8 C ∶= (A−1 + B −1 )−1 c ∶= C(A−1 a + B −1 b) 6 4 µ2 0 −2 −4 −4 −2 0 µ1 4 6 8 5 , Closure Under Multiplication multiple Gaussian factors form a Gaussian N (x; a, A)N (x; b, B) = N (x; c, C)N (a; b, A + B) 8 C ∶= (A−1 + B −1 )−1 c ∶= C(A−1 a + B −1 b) 6 4 µ2 0 −2 −4 −4 −2 0 µ1 4 6 8 5 , Closure Under Multiplication multiple Gaussian factors form a Gaussian N (x; a, A)N (x; b, B) = N (x; c, C)N (a; b, A + B) 8 C ∶= (A−1 + B −1 )−1 c ∶= C(A−1 a + B −1 b) 6 4 µ2 0 −2 −4 −4 −2 0 µ1 4 6 8 5 , Closure under Linear Maps Linear Maps of Gaussians are Gaussians 8 6 4 µ2 ⇒ −4 −4 p(Az) = N (Az, Aµ, AΣA⊺ ) Here: A = [1, −0.5] 0 −2 p(z) = N (z; µ, Σ) −2 0 µ1 4 6 8 6 , Closure under Marginalization projections of Gaussians are Gaussian ▸ projection with A = (1 0) x µx Σxx ∫ N [(y ) ; (µ ) , (Σ y yx Σxy )] dy = N (x; µx , Σxx ) Σyy 8 6 ▸ 4 µ2 0 −2 −4 −4 −2 ▸ 0 µ1 4 6 this is the sum rule ∫ p(x, y) dy = ∫ p(y ∣ x)p(x) dy = p(x) so every finite-dim Gaussian is a marginal of infinitely many more 8 7 , Closure under Conditioning cuts through Gaussians are Gaussians p(x ∣ y) = p(x, y) −1 = N (x; µx + Σxy Σ−1 yy (y − µy ), Σxx − Σxy Σyy Σyx ) p(y) 8 6 4 ▸ µ2 ▸ 0 −2 −4 −4 −2 0 µ1 4 6 this is the product rule so Gaussians are closed under the rules of probability 8 8 , What can we do with this? linear regression given y ∈ RN , p(y ∣ f ), what’s f ? 20 y 10 0 −10 −8 −6 −4 −2 0 x 2 4 6 8 10 , A prior over linear functions f (x) = w1 + w2 x = φ⊺x w 1 φx = ( ) x p(w) = N (w; µ, Σ) 20 p(f ) = N (f ; φ⊺x µ, φ⊺x Σφx ) 2 10 0 −2 0 −2 0 2 −10 −5 0 5 11 , A prior over linear functions f (x) = w1 + w2 x = φ⊺x w 1 φx = ( ) x p(w) = N (w; µ, Σ) 20 p(f ) = N (f ; φ⊺x µ, φ⊺x Σφx ) 2 10 0 −2 0 −2 0 2 −10 −5 0 5 12 , The posterior over linear functions p(y ∣ w, φX ) = N (y; φ⊺X w, σ 2 I) p(w ∣ y, φX ) = N (w; µ + ΣφX (φ⊺X ΣφX + σ 2 I)−1 (y − φ⊺X µ), Σ − ΣφX (φ⊺X ΣφX + σ 2 I)−1 φ⊺X Σ)φx 20 2 10 0 −2 0 −2 0 2 −10 −5 0 5 13 , The posterior over linear functions p(y ∣ w, φX ) = N (y; φ⊺X w, σ 2 I) p(fx ∣ y, φX ) = N (fx ; φ⊺x µ + φ⊺x ΣφX (φ⊺X ΣφX + σ 2 I)−1 (y − φ⊺X µ), φ⊺x Σφx − φ⊺x ΣφX (φ⊺X ΣφX + σ 2 I)−1 φ⊺X Σφx 20 2 10 0 −2 0 −2 0 2 −10 −5 0 5 13 , % prior F = phi = mu = Sigma = on w 2; @(a)(bsxfun(@power,a,0:F-1)); zeros(F,1); eye(F); % prior n = phix = m = kxx = s = stdpi = on f (x) 100; x = linspace(-6,6,n)’; % ‘test’ points phi(x); % features of x phix * mu; phix * Sigma * phix’; % p(fx ) = N (m, kxx ) bsxfun(@plus,m,chol(kxx + 1.0e-8 * eye(n))’ * randn(n,3)); % samples from prior sqrt(diag(kxx)); % marginal stddev, for plotting load(’data.mat’); N = length(Y); % prior phiX = M = kXX = on Y = fX + phi(X); phiX * mu; phiX * Sigma * phiX’; G R = kXX + sigma^2 * eye(N); = chol(G); kxX A = phix * Sigma * phiX’; = kxX / R; % number of features % φ(a) = [1; a] % p(w) = N (µ, Σ) % gives Y,X,sigma % features of data % p(fX ) = N (M, kXX ) % p(Y ) = N (M, kXX + σ 2 I) % most expensive step: O(N 3 ) % cov(fx , fX ) = kxX % pre-compute for re-use mpost = m + A * (R’ \ (Y-M)); % p(fx ∣ Y ) = N (m + kxX (kXX + σ 2 I)−1 (Y − M ), vpost = kxx - A * A’; % kxx − kxX (kXX + σ 2 I)−1 kXx ) spost = bsxfun(@plus,mpost,chol(vpost + 1.0e-8 * eye(n))’ * randn(n,3)); % samples stdpo = sqrt(diag(vpost)); % marginal stddev, for plotting 14 , A More Realistic Dataset General Linear Regression f (x) = φ⊺x w 20 ? y 10 0 −10 −8 −6 −4 −2 0 x 2 4 6 8 15 , f (x) = w1 + w2 x = φ⊺x w 1 φx ∶= ( ) x 16 , % prior F = phi = mu = Sigma = on w 2; @(a)(bsxfun(@power,a,0:F-1)); zeros(F,1); eye(F); % prior n = phix = m = kxx = s = stdpi = on f (x) 100; x = linspace(-6,6,n)’; % ‘test’ points phi(x); % features of x phix * mu; phix * Sigma * phix’; % p(fx ) = N (m, kxx ) bsxfun(@plus,m,chol(kxx + 1.0e-8 * eye(n))’ * randn(n,3)); % samples from prior sqrt(diag(kxx)); % marginal stddev, for plotting load(’data.mat’); N = length(Y); % prior phiX = M = kXX = on Y = fX + phi(X); phiX * mu; phiX * Sigma * phiX’; G R = kXX + sigma^2 * eye(N); = chol(G); kxX A = phix * Sigma * phiX’; = kxX / R; % number of features % φ(a) = [1; a] % p(w) = N (µ, Σ) % gives Y,X,sigma % features of data % p(fX ) = N (M, kXX ) % p(Y ) = N (M, kXX + σ 2 I) % most expensive step: O(N 3 ) % cov(fx , fX ) = kxX % pre-compute for re-use mpost = m + A * (R’ \ (Y-M)); % p(fx ∣ Y ) = N (m + kxX (kXX + σ 2 I)−1 (Y − M ), vpost = kxx - A * A’; % kxx − kxX (kXX + σ 2 I)−1 kXx ) spost = bsxfun(@plus,mpost,chol(vpost + 1.0e-8 * eye(n))’ * randn(n,3)); % samples stdpo = sqrt(diag(vpost)); % marginal stddev, for plotting 17 , Cubic Regression phi = @(a)(bsxfun(@power,a,[0:3])); 20 φ(x) = (1 f (x) = φ(x)⊺ w x x.2 ⊺ x.3 ) 10 0 −10 −8 −6 −4 −2 0 2 4 6 8 18 , Cubic Regression phi = @(a)(bsxfun(@power,a,[0:3])); 20 φ(x) = (1 f (x) = φ(x)⊺ w x x.2 ⊺ x.3 ) 10 0 −10 −8 −6 −4 −2 0 2 4 6 8 18 , Septic Regression ? phi = @(a)(bsxfun(@power,a,[0:7])); 20 f (x) = φ(x)⊺ w φ(x) = (1 x x.2 ⋯ ⊺ x.7 ) 10 0 −10 −8 −6 −4 −2 0 2 4 6 8 19 , Septic Regression ? phi = @(a)(bsxfun(@power,a,[0:7])); 20 f (x) = φ(x)⊺ w φ(x) = (1 x x.2 ⋯ ⊺ x.7 ) 10 0 −10 −8 −6 −4 −2 0 2 4 6 8 19 , Fourier Regression phi = @(a)(2 * [cos(bsxfun(@times,a/8,[0:8])), sin(bsxfun(@times,a/8,[1:8]))]); φ(x) = (cos(x) cos(2x) cos(3x) ... sin(x) sin(2x) ⊺ . . .) 20 10 0 −10 −8 −6 −4 −2 0 2 4 6 8 20 , Fourier Regression phi = @(a)(2 * [cos(bsxfun(@times,a/8,[0:8])), sin(bsxfun(@times,a/8,[1:8]))]); φ(x) = (cos(x) cos(2x) cos(3x) ... sin(x) sin(2x) ⊺ . . .) 20 10 0 −10 −8 −6 −4 −2 0 2 4 6 8 20 , Step Regression phi = @(a)(-1 + 2 * bsxfun(@lt,a,linspace(-8,8,16))); φ(x) = −1 + 2 (θ(x − 8) 20 θ(8 − x) θ(x − 7) θ(7 − x) . . .) ⊺ 10 0 −10 −8 −6 −4 −2 0 2 4 6 8 21 , Step Regression phi = @(a)(-1 + 2 * bsxfun(@lt,a,linspace(-8,8,16))); φ(x) = −1 + 2 (θ(x − 8) 20 θ(8 − x) θ(x − 7) θ(7 − x) . . .) ⊺ 10 0 −10 −8 −6 −4 −2 0 2 4 6 8 21 , V Regression phi = @(a)(bsxfun(@minus,abs(bsxfun(@minus,a,linspace(-8,8,16))),linspace(-8,8,16))); 20 φ(x) = (∣x − 8∣ + 8 ∣x − 7∣ + 7 ∣x − 6∣ + 6 . . .) ⊺ 10 0 −10 −8 −6 −4 −2 0 2 4 6 8 23 , V Regression phi = @(a)(bsxfun(@minus,abs(bsxfun(@minus,a,linspace(-8,8,16))),linspace(-8,8,16))); 20 φ(x) = (∣x − 8∣ + 8 ∣x − 7∣ + 7 ∣x − 6∣ + 6 . . .) ⊺ 10 0 −10 −8 −6 −4 −2 0 2 4 6 8 23 , Eiffel Tower Regression phi = @(a)(exp(-abs(bsxfun(@minus,a,[-8:1:8])))); 20 φ(x) = (e−∣x−8∣ e−∣x−7∣ ⊺ e−∣x−6∣ . . .) 2 4 10 0 −10 −8 −6 −4 −2 0 6 8 25 , Eiffel Tower Regression phi = @(a)(exp(-abs(bsxfun(@minus,a,[-8:1:8])))); 20 φ(x) = (e−∣x−8∣ e−∣x−7∣ ⊺ e−∣x−6∣ . . .) 2 4 10 0 −10 −8 −6 −4 −2 0 6 8 25 , Bell Curve Regression phi = @(a)(exp(-0.5 * bsxfun(@minus,a,[-8:1:8]).^2)); φ(x) = (e− 2 (x−8) 1 20 2 e− 2 (x−7) 2 1 e− 2 (x−6) 1 2 ⊺ . . .) 10 0 −10 −8 −6 −4 −2 0 2 4 6 8 26 , Bell Curve Regression phi = @(a)(exp(-0.5 * bsxfun(@minus,a,[-8:1:8]).^2)); φ(x) = (e− 2 (x−8) 1 20 2 e− 2 (x−7) 2 1 e− 2 (x−6) 1 2 ⊺ . . .) 10 0 −10 −8 −6 −4 −2 0 2 4 6 8 26 , Multiple Inputs all this works for in multiple dimensions, too φ ∶ RN _ R f ∶ RN _ R 27 , Multiple Inputs all this works for in multiple dimensions, too 28 , How many features should we use? let’s look at that algebra again p(fx ∣ y, φX ) = N (fx ; φ⊺x µ + φ⊺x ΣφX (φ⊺X ΣφX + σ 2 I)−1 (y − φ⊺X µ), φ⊺x Σφx − φ⊺x ΣφX (φ⊺X ΣφX + σ 2 I)−1 φ⊺X Σφx ) ▸ ▸ there’s no lonely φ in there all objects involving φ are of the form ▸ ▸ ▸ ▸ φ⊺ µ — the mean function φ⊺ Σφ — the kernel once these are known, cost is independent of the number of features remember the code: M m kXX kxx = = = = phiX phix phiX phix * * * * mu; mu; Sigma * phiX’; Sigma * phix’; kxX = phix * Sigma * phiX’; % p(fX ) = N (M, kXX ) % p(fx ) = N (m, kxx ) % cov(fx , fX ) = kxX 32 , % prior F = phi = mu = Sigma = on w 2; @(a)(bsxfun(@power,a,0:F-1)); zeros(F,1); eye(F); % prior n = phix = m = kxx = s = stdpi = on f (x) 100; x = linspace(-6,6,n)’; % ‘test’ points phi(x); % features of x phix * mu; phix * Sigma * phix’; % p(fx ) = N (m, kxx ) bsxfun(@plus,m,chol(kxx + 1.0e-8 * eye(n))’ * randn(n,3)); % samples from prior sqrt(diag(kxx)); % marginal stddev, for plotting load(’data.mat’); N = length(Y); % prior phiX = M = kXX = on Y = fX + phi(X); phiX * mu; phiX * Sigma * phiX’; G R = kXX + sigma^2 * eye(N); = chol(G); kxX A = phix * Sigma * phiX’; = kxX / R; % number of features % φ(a) = [1; a] % p(w) = N (µ, Σ) % gives Y,X,sigma % features of data % p(fX ) = N (M, kXX ) % p(Y ) = N (M, kXX + σ 2 I) % most expensive step: O(N 3 ) % cov(fx , fX ) = kxX % pre-compute for re-use mpost = m + A * (R’ \ (Y-M)); % p(fx ∣ Y ) = N (m + kxX (kXX + σ 2 I)−1 (Y − M ), vpost = kxx - A * A’; % kxx − kxX (kXX + σ 2 I)−1 kXx ) spost = bsxfun(@plus,mpost,chol(vpost + 1.0e-8 * eye(n))’ * randn(n,3)); % samples stdpo = sqrt(diag(vpost)); % marginal stddev, for plotting 33 , % prior F = phi = k = mu = 2; @(a)(bsxfun(@power,a,0:F)); @(a,b)(phi(a)’ * phi(b)); @(a)(zeros(size(a,1))); % number of features % φ(a) = [1; a] % kernel % mean function % belief on f (x) n = 100; x = linspace(-6,6,n)’; % ‘test’ points m = mu(x); kxx = k(x,x); % p(fx ) = N (m, kxx ) s = bsxfun(@plus,m,chol(kxx + 1.0e-8 * eye(n))’ * randn(n,3)); % samples from prior stdpi = sqrt(diag(kxx)); % marginal stddev, for plotting load(’data.mat’); N = length(Y); % prior on Y = fX + M = mu(X); kXX = k(X,X); G R = kXX + sigma^2 * eye(N); = chol(G); kxX A = k(x,X); = kxX / R; % gives Y,X,sigma % p(fX ) = N (M, kXX ) % p(Y ) = N (M, kXX + σ 2 I) % most expensive step: O(N 3 ) % cov(fx , fX ) = kxX % pre-compute for re-use mpost = m + A * (R’ \ (Y-M)); % p(fx ∣ Y ) = N (m + kxX (kXX + σ 2 I)−1 (Y − M ), vpost = kxx - A * A’; % kxx − kxX (kXX + σ 2 I)−1 kXx ) spost = bsxfun(@plus,mpost,chol(vpost + 1.0e-8 * eye(n))’ * randn(n,3)); % samples stdpo = sqrt(diag(vpost)); % marginal stddev, for plotting 34 , Exponentiated Squares phi = @(a)(exp(-0.5 * bsxfun(@minus,a,linspace(-8,8,10)).^2 ./ell.^2)); 20 10 0 −10 −8 −6 −4 −2 0 2 4 6 8 37 , Exponentiated Squares phi = @(a)(exp(-0.5 * bsxfun(@minus,a,linspace(-8,8,30)).^2 ./ell.^2)); 20 10 0 −10 −8 −6 −4 −2 0 2 4 6 8 37 , Exponentiated Squares k = @(a,b)(5*exp(-0.25*bsxfun(@minus,a,b’).^2)); 20 10 0 −10 ▸ −8 −6 −4 −2 0 2 4 6 8 aka. radial basis function, square(d)-exponential kernel 37 , Exponentiated Squares k = @(a,b)(5*exp(-0.25*bsxfun(@minus,a,b’).^2)); 20 10 0 −10 ▸ −8 −6 −4 −2 0 2 4 6 8 aka. radial basis function, square(d)-exponential kernel 37 , What just happened? Gaussian process priors Definition A function k ∶ X × X _ R is a Mercer kernel if, for any finite collection X = [x1 , . . . , xN ], the matrix kXX ∈ RN ×N with elements kXX,(i,j) = k(xi , xj ) is positive semidefinite. Definition Let µ ∶ X _ R be any function, k ∶ X × X _ R be a Mercer kernel. A Gaussian process p(f ) = GP(f ; µ, k) is a probability distribution over the function f ∶ X _ R, such that every finite restriction to function values fX ∶= [fx1 , . . . , fxN ] is a Gaussian distribution p(fX ) = N (fX ; µX , kXX ). 39 , Samples from a zero-mean Gaussian process (Source: GPML, Figure 2.2) • Samples from the prior are zero-mean, with values drawn from a multivariate Gaussian distribution • The kernel/covariance function k tells us how correlated the function values at two points are 7 Kernels • The dot-product kernel • The squared exponential kernel • There are many kernels! – See, e.g. “Learning with kernels” by Bernhard Schölkopf, 1998 – Kernels over sentences, images, grammars, graphs, etc. 8 Kernels have hyperparameters E.g., length scale hyperparameter in the squared exponential kernel 9 Samples from the prior and the posterior • The joint distribution of the observed function values f1:N and a new function value f* is multivariate Gaussian • Then, the posterior is also a Gaussian 10 Exercise: derive the posterior • The joint distribution of the observed function values f1:N and a new function value f* is multivariate Gaussian • Then, the posterior is also a Gaussian 11 The predictive posterior distribution The posterior Gaussian process has predictive distribution , where (derivation: exercise) 12 The predictive posterior under noise The posterior Gaussian process has predictive distribution , where (derivation: exercise) 13 Computational complexity of GPs • Let t denote the number of data points in the GP • Fitting: inverting the kernel matrix costs time O(t3) – In practice, we also need to optimize/sample the kernel hyperparameters (such as the kernel’s length scale) – Evaluating k different values of : O(k t3) • Predictions of the variance: O(t2) • Predictions of the mean: O(t) The posterior Gaussian process has predictive distribution , where 14 Outline of Today’s Class • Bayesian optimization – Bayesian models: Bayesian linear regression & Gaussian processes – Acquisition functions • Extensions of Bayesian optimization for AutoML – Bayesian optimization with random forests – Applications 15 The acquisition function • Given a posterior model , which point should we select next to find the maximum of f ? • We want to trade off – Exploitation (sampling where we expect the function to be high) vs. – Exploration (sampling where we’re uncertain about the function value) • Various acquisition functions achieve this tradeoff 16 Probability of Improvement 17 Expected Improvement (derivation of the closed-form solution: exercise) 18 Upper Confidence Bound (UBC) 19 Entropy Search [Hennig & Schuler, JMLR 2012] • Compute a probability distribution over which input x is optimal • Acquisition function: try to push this probability distribution as close to a delta distribution as possible • One of the most powerful acquisition functions – Can choose to actively evaluate in one region of the space to learning something about a different region of the space – Example application in hyperparameter optimization: evaluate a hyperparameter setting based on a subset of data points (cheap) to learn about what is the best setting for all data 20 Putting it all Together • How to optimize the acquisition function? – Subsidiary optimization method – Important: in that subsidiary optimization, function evaluations are cheap (just predictions of the GP) • How to update the GP? – Subsidiary optimization method to set kernel hyperparameters – Function evaluations are also cheap (just fitting the GP) 21 Summary of Bayesian Optimization • Bayesian optimization integrates – prior information and – the likelihood of the observed data • It uses quite involved computation to select which function value to evaluate next – Thus, it’s most useful for expensive blackbox functions 22 Outline of Today’s Class • Bayesian optimization – Bayesian models: Bayesian linear regression & Gaussian processes – Acquisition functions • Extensions of Bayesian optimization for AutoML – Bayesian optimization with random forests – Applications 23 Motivation: AutoML • Machine Learning has celebrated substantial successes • But it requires human experts to – – – – – – Preprocess the data Perform feature selection Select a model family Optimize hyperparameters … Determine the effect of hyperparameters / choice of model • AutoML: taking the human expert out of the loop “Civilization advances by extending the number of important operations which we can perform without thinking of them” (Alfred North Whitehead) 24 Motivation: AutoML • First workshop on AutoML at the International Conference on Machine Learning (ICML) – Beijing, China, two weeks ago – Some topics: • • • • • • Hyperparameter optimization Combined search over model types & their hyperparameters Combined architecture and hyperparameter optimization Prediction of learning curves of deep neural networks Automated learning of ensembles Learning across data sets • AutoML challenge, organized by Isabelle Guyon – In 5 phases, September – April – Room for MSc projects/theses 25 Types of Hyperparameters • Numerical hyperparameters – Continuous – Integer • Categorical: hyperparameters – Finite domain, unordered – Special case: Boolean • Examples in neural networks – – – – Continuous: learning rate, momentum, regularization, etc Integer: #neurons, #layers, # of SGD steps Categorical: activation function {tanh, ReLu, or sigmoid} Boolean: use preprocessing or not 26 Conditional Hyperparameters • Conditional hyperparameters are only active if certain other hyperparameters take certain values – The hyperparameters they depend on are called their parents • Examples in neural networks – All hyperparameters in layer K are only active if the network has at least K layers – E.g., “Learning rate in layer 3” is a conditional parameter with parent “network depth” • Examples in model selection – Select between algorithms A and B and their hyperparameters – Then A’s hyperparameters are only active if we select A 27 Details on Conditional Hyperparameters • Conditional hyperparameters can be parents themselves – The hyperparameter space is often tree-structured – Sometimes, hyperparameters depend on multiple parents; the space is then structured as a directed acyclic graph (DAG) • Semantics: – If a hyperparameter is not active it does not matter which value we select for it – The learning algorithm will not even inspect the hyperparameter 28 Bayesian Optimization for AutoML • Problems for standard Gaussian Process (GP) approach: – Complex parameter space • High-dimensional (low effective dimensionality) • Categorical parameters: finite domain, unordered, e.g., activation function {tanh, ReLu, or sigmoid} • Structure: e.g., “Learning rate in layer 3” is a conditional parameter with parent “network depth” – Noise: sometimes heteroscedastic, large, non-Gaussian – Robustness (usability out of the box) – Model overhead (budget is runtime, not #function evaluations) • One solution: random forests [Breiman, '01] – Adapted to yield uncertainty estimates 29 Random Forests (RFs) • RFs for classification = bagging + decision trees • You learned about decision trees before • You also heard about bagging before – – – – Take T bootstrap samples of the data (sample with repetitions) For each bootstrap sample, fit a machine learning model Average the T predictions Effective method to reduce over-fitting • Bagging performs better with more uncorrelated models – Trees are a great model for bagging: individual trees overfit to different characteristics of the bootstrap samples, and are thus only weakly correlated 30 Random Forests (RFs) • Bagging is better the more uncorrelated its models are • Idea: make trees more correlated using randomness beyond the bootstrap sampling – E.g., perturb data slightly in each tree – E.g., in each split only allow splits on a random subset of variables (only evaluate splitting criterion on those) – E.g., in each split only allow a random subset of split points for each split variable 31 Random Forests (RFs) • Invented by Leo Breiman in 2001 • Combine bagging & decision trees – While trees substantially overfit, RFs don’t typically overfit (much) – RFs typically outperform Leo Breiman pruned decision trees American statistician 1928 – 2005 – RFs are the most robust off-the-shelf supervised learning Image source: Wikipedia method (together with boosted trees) – Many ML competitions have been won with straight-up RFs – Came from statistics literature, initially not very popular in ML (not kernel-based, not neural networks) 32 Random Forests (RFs) • One of the most frequently cited papers in ML 33 Regression Trees • You have learned about decision trees for classification – Prediction: majority class label in selected leaf – Splitting criterion: information gain • Regression trees: the analogue for regression – Prediction: mean numerical value in selected leaf – Splitting criterion: minimize squared differences to means in resulting left child R1 and right child R2 (resulting from choosing a spit variable j and split point s) 34 Regression Tree Training – In each internal node: only store split criterion used – In each leaf: store mean of runtimes param3 {blue, green} param3 {red} feature2 ≤ 3.5 3.7 feature2 > 3.5 1.65 … 35 Regression Tree Predictions E.g. xn+1 = (true, 4.7, red) – Walk down tree, return mean runtime stored in leaf 1.65 param3 {blue, green} param3 {red} feature2 ≤ 3.5 3.7 feature2 > 3.5 1.65 … 36 Random Forests: Sets of Regression Trees … • Training – Draw T bootstrap samples of the data – For each bootstrap sample, fit a randomized regression tree • Prediction – Predict with each of the T trees – Return empirical mean and variance across these T predictions • Complexity for N data points and T trees – Training: O(TN log2 N) – Prediction: O(Tlog N) • One of the best off-the-shelf learning methods available 37 SMAC in a Nutshell [Hutter, Hoos, Leyton-Brown, 2009-2014] SMAC: Sequential Model-Based Algorithm Configuration repeat construct RF model to predict performance use that model to select promising configurations compare each selected configuration against the best known until time budget exhausted • Distributed SMAC – Maintain queue of promising configurations – Compare these to * on distributed worker cores 38 Outline of Today’s Class • Bayesian optimization – Bayesian models: Bayesian linear regression & Gaussian processes – Acquisition functions • Extensions of Bayesian optimization for AutoML – Bayesian optimization with random forests – Applications 39 Application to Auto-WEKA: off-the-shelf ML [Thornton, Hutter, Hoos & Leyton-Brown, KDD'13] WEKA [Witten et al, 1999-current] – most widely used off-the-shelf machine learning package – over 20,000 citations on Google scholar Java implementation of a broad range of methods – 27 base classifiers (with up to 10 parameters each) – 10 meta-methods – 2 ensemble methods Different methods work best on different data sets – Want a true off-the-shelf solution: Learn 40 WEKA’s configuration space Base classifiers – 27 choices, each with subparameters Hierarchical structure on top of base classifiers – In total: 768 parameters, 1047 configurations – Optimize cross-validation performance over this space using SMAC 41 Auto-WEKA: Results Auto-WEKA performs better than best base classifier – – – – Even when “best classifier” uses an oracle Especially on the 8 largest datasets In 6/21 cases more than 10% reductions in relative error Time requirements: 30h on 4 cores Comparison to full grid search – Union of grids over parameters of all 27 base classifiers – Auto-WEKA is 100 times faster – Auto-WEKA has better generalization performance in 15/21 cases Auto-WEKA based on SMAC vs. TPE [Bergstra et al, 2011] – SMAC yielded better CV performance in 19/21 cases – SMAC yielded better generalization performance in 14/21 cases – Differences usually small, in 3 cases substantial (SMAC better) 42 Auto-WEKA Discussion • Off-the-shelf machine learning tools are now available – Expert understanding of ML techniques not required to use them – Users still need to provide good features Learn • Auto-WEKA is available online: automl.org/autoweka • Ongoing work – Wrappers from several programming languages – Reason across datasets to jump-start Auto-WEKA – Reduce need for manual feature engineering via representation learning (Auto-Deep) 43 Comparing Hyperparameter Optimizers [Eggensperger, Feurer, Hutter, Bergstra, Snoek, Hoos & Leyton-Brown, 2013] • Hyperparameter optimization library: automl.org/hpolib – Benchmarks • • • • Artificial test functions (for quick debugging) Low-dimensional: logistic regression, online LDA, structured SVM Medium-dimensional: neural network, deep network High-dimensional: Auto-WEKA – Optimizers • SMAC [Hutter et al, '11], based on random forests • Spearmint [Snoek et al, '12], based on Gaussian processes • TPE [Bergstra et al, '11], based on density estimators (EDA algorithm) – Results • Spearmint performs best for low-dimensional continuous problems • SMAC performs best for high-dimensional structured optimization, e.g. combined architecture search and hyperparamter optimization 44 Summary • We discussed – GP-based Bayesian optimization – Random forests – Applications in AutoML • Learning goals: you can now … – – – – – Derive the principle behind Bayesian optimization Explain various acquisition functions Explain Bayesian linear regression and Gaussian processes Explain regression trees and random forests Discuss some limitations of GP-based Bayesian optimization 45 Derivation: posterior predictive distribution 46