Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Hierarchical Bayesian Optimization Algorithm (hBOA) Martin Pelikan University of Missouri at St. Louis [email protected] Foreword Motivation • • • • Black-box optimization (BBO) problem • Set of all potential solutions • Performance measure (evaluation procedure) Task: Find optimum (best solution) Formulation useful: No need for gradient, numerical functions, … But many important and tough challenges This talk • • Combine machine learning and evolutionary computation Create practical and powerful optimizers (BOA and hBOA) 2 Overview Black-box optimization (BBO) BBO via probabilistic modeling • Motivation and examples • Bayesian optimization algorithm (BOA) • Hierarchical BOA (hBOA) Theory and experiment Conclusions 3 Black-box Optimization Input • • How do potential solutions look like? How to evaluate quality of potential solutions? Output • Best solution (the optimum) Important • • • We don’t know what’s inside evaluation procedure Vector and tree representations common This talk: Binary strings of fixed length 4 BBO: Examples Atomic cluster optimization • • Solutions: Vectors specifying positions of all atoms Performance: Lower energy is better Telecom network optimization • • Solutions: Connections between nodes (cities, …) Performance: Satisfy constraints, minimize cost Design • • Solutions: Vectors specifying parameters of the design Performance: Finite element analysis, experiment, … 5 BBO: Advantages & Difficulties Advantages • • Use same optimizer for all problems. No need for much prior knowledge. Difficulties • • • • Many places to go • • 100-bit strings…1267650600228229401496703205376 solutions. Enumeration is not an option. Many places to get stuck • Local operators are not an option. Must learn what’s in the box automatically. Noise, multiple objectives, interactive evaluation, ... 6 Typical Black-Box Optimizer Sample solutions Evaluated sampled solutions Learn to sample better Evaluate Sample Learn 7 Many Ways to Do It Hill climber Simulated annealing Evolutionary algorithms • Start with a random solution. • Flip bit that improves the solution most. • Finish when no more improvement possible. • Introduce Metropolis. • Inspiration from natural evolution and genetics. 8 Evolutionary Algorithms Evolve a population of candidate solutions. Start with a random population. Iteration • Selection • • Select promising solutions Variation Apply crossover and mutation to selected solutions Replacement Incorporate new solutions into original population 9 Estimation of Distribution Algorithms Replace standard variation operators by • Building a probabilistic model of promising • solutions Sampling the built model to generate new solutions Probabilistic model • Stores features that make good solutions good • Generates new solutions with just those features 10 EDAs Selected population New population 11001 11001 01111 10101 10101 Probabilistic 11001 01011 01011 Model 11011 11000 11000 Current population 00111 11 What Models to Use? Our plan • Simple example: Probability vector for binary strings • Bayesian networks (BOA) • Bayesian networks with local structures (hBOA) 12 Probability Vector Baluja (1995) Assumes binary strings of fixed length Stores probability of a 1 in each position. New strings generated with those proportions. Example: (0.5, 0.5, …, 0.5) for uniform distribution (1, 1, …, 1) for generating strings of all 1s 13 EDA Example: Probability Vector Current population Selected population New population 11001 11001 10101 10101 10101 01011 01011 11101 11000 11000 11001 1.0 0.5 0.5 0.0 1.0 11001 10001 10101 01011 11000 14 Probability Vector Dynamics Bits that perform better get more copies. And are combined in new ways. But context of each bit is ignored. Example problem 1: ONEMAX n f ( X 1 , X 2 , , X n ) X i i 1 Optimum: 111…1 15 Probability Vector on ONEMAX 1s of entries Proportions Probability vector 1 0.9 0.8 0.7 Optimum 0.6 0.5 0.4 0.3 0.2 0.1 0 0 10 20 30 40 50 Generation Iteration 16 Probability Vector on ONEMAX 1s of entries Proportions Probability vector 1 0.9 0.8 0.7 Optimum 0.6 Success 0.5 0.4 0.3 0.2 0.1 0 0 10 20 30 40 50 Generation Iteration 17 Probability Vector: Ideal Scale-up O(n log n) evaluations until convergence Other algorithms • (Harik, Cantú-Paz, Goldberg, & Miller, 1997) • (Mühlenbein, Schlierkamp-Vosen, 1993) • Hill climber: O(n log n) (Mühlenbein, 1992) • GA with uniform: approx. O(n log n) • GA with one-point: slightly slower 18 When Does Prob. Vector Fail? Example problem 2: Concatenated traps • Partition input string into disjoint groups of 5 bits. • Each group contributes via trap (ones=num. ones): if ones 5 5 trap(ones) 4 ones otherwise • Concatenated trap = sum of single traps • Optimum: 111…1 19 Trap 5 Global optimum trap(u) 4 3 2 1 0 0 1 2 3 Number ofofones, Number 1s u 4 5 20 1s of entries Proportions Probability vector Probability Vector on Traps 1 0.9 0.8 0.7 Optimum 0.6 0.5 0.4 0.3 0.2 0.1 0 0 10 20 30 40 50 Generation Iteration 21 1s of entries Proportions Probability vector Probability Vector on Traps 1 0.9 0.8 0.7 Optimum 0.6 Failure 0.5 0.4 0.3 0.2 0.1 0 0 10 20 30 40 50 Generation Iteration 22 Why Failure? Onemax: • • Optimum in 111…1 1 outperforms 0 on average. Traps: optimum in 11111, but So single bits are misleading. • f(0****) = 2 • f(1****) = 1.375 23 How to Fix It? Consider 5-bit statistics instead of 1-bit ones. Then, 11111 would outperform 00000. Learn model • Compute p(00000), p(00001), …, p(11111) Sample model • • Sample 5 bits at a time Generate 00000 with p(00000), 00001 with p(00001), … 24 Correct Model on Traps: Dynamics 1s Proportions Probabilities of of11111 1 0.9 0.8 0.7 Optimum 0.6 0.5 0.4 0.3 0.2 0.1 0 0 10 20 30 40 50 Generation Iteration 25 Correct Model on Traps: Dynamics 1s Proportions Probabilities of of11111 1 0.9 0.8 0.7 Optimum 0.6 Success 0.5 0.4 0.3 0.2 0.1 0 0 10 20 30 40 50 Generation Iteration 26 Good News: Good Stats Work Great! Optimum in O(n log n) evaluations. Same performance as on onemax! Others • Hill climber: O(n5 log n) = much worse. • GA with uniform: O(2n) = intractable. • GA with one point: O(2n) (without tight linkage). 27 Challenge If we could learn and use context for each position • Find nonmisleading statistics. • Use those statistics as in probability vector. Then we could solve problems decomposable into statistics of order at most k with at most O(n2) evaluations! • And there are many of those problems. 28 Bayesian Optimization Algorithm (BOA) Pelikan, Goldberg, & Cantú-Paz (1998) Use a Bayesian network (BN) as a model. Bayesian network • Acyclic directed graph. • Nodes are variables (string positions). • Conditional dependencies (edges). • Conditional independencies (implicit). 29 Conditional Dependency Z Y X X 0 0 0 0 1 1 1 1 Y 0 0 1 1 0 0 1 1 Z P(X | Y, Z) 0 10% 1 5% 0 25% 1 94% 0 90% 1 95% 0 75% 1 6% 30 Bayesian Network (BN) Explicit: Conditional dependencies. Implicit: Conditional independencies. Probability tables 31 BOA Current population Selected population Bayesian network New population 32 BOA Variation Two steps • Learn a Bayesian network (for promising solutions) • Sample the built Bayesian network (to generate new candidate solutions) Next • Brief look at the two steps in BOA 33 Learning BNs Two components: • Scoring metric (to evaluate models). • Search procedure (to find the best model). 34 Learning BNs: Scoring Metrics Bayesian metrics • Bayesian-Dirichlet with likelihood equivalence (m'( i )) (m'(xi , i ) m(xi , i )) BD(B) p(B) (m'( ) m( )) (m'(xi , i )) i1 xi i i n i Minimum description length metrics • Bayesian information criterion (BIC) i log 2 N BIC(B) H( X i | i )N 2 2 i1 n 35 Learning BNs: Search Procedure Start with an empty network (like prob. vec.). Execute primitive operator that improves the metric the most. Until no more improvement possible. Primitive operators • Edge addition • Edge removal • Edge reversal. 36 Sampling BNs: PLS Probabilistic logic sampling (PLS) Two phases • Create ancestral ordering of variables: • Each variable depends only on predecessors Sample all variables in that order using CPTs: Repeat for each new candidate solution 37 BOA Theory: Key Components Primary target: Scalability Population sizing N • How large populations for reliable solution? Number of generations (iterations) G • How many iterations until convergence? Overall complexity • • O(N x G) Overhead: Low-order polynomial in N, G, and n. 38 BOA Theory: Population Sizing Assumptions: n bits, subproblems of order k Initial supply (Goldberg) • Have enough partial sols. to combine. Decision making (Harik et al, 1997) • Decide well between competing partial sols. Drift (Thierens, Goldberg, Pereira, 1998) • Don’t lose less salient stuff prematurely. Model building (Pelikan et al., 2000, 2002) • Find a good model. O 2k log n O n log n O n O n1.55 39 BOA Theory: Num. of Generations Two bounding cases Uniform scaling • • Subproblems converge in parallel Onemax model (Muehlenbein & Schlierkamp-Voosen, 1993) O n Exponential scaling • • Subproblems converge sequentially Domino convergence (Thierens, Goldberg, Pereira, 1998) O n 40 Good News Theory • • Population sizing (Pelikan et al., 2000, 2002) 1. 2. 3. 4. Initial supply. Decision making. Drift. Model building. Iterations until convergence (Pelikan et al., 2000, 2002) 1. Uniform scaling. 2. Exponential scaling. O(n) to O(n1.05) O(n0.5) to O(n) BOA solves order-k decomposable problems in O(n1.55) to O(n2) evaluations! 41 Number of Evaluations Theory vs. Experiment (5-bit Traps) 500000 450000 400000 Experiment Theory 350000 300000 250000 200000 150000 100000 100 125 150 175 200 225 250 Problem Size 42 Additional Plus: Prior Knowledge BOA need not know much about problem BOA can use prior knowledge • Only set of solutions + measure (BBO). • High-quality partial or full solutions. • Likely or known interactions. • Previously learned structures. • Problem specific heuristics, search methods. 43 From Single Level to Hierarchy What if problem can’t be decomposed like this? Inspiration from human problem solving. Use hierarchical decomposition • Decompose problem on multiple levels. • Solutions from lower levels = basic building blocks • for constructing solutions on the current level. Bottom-up hierarchical problem solving. 44 Hierarchical Decomposition Car Engine Fuel system Braking system Valves Electrical system Ignition system 45 3 Keys to Hierarchy Success Proper decomposition Chunking Preservation of alternative solutions • Must decompose problem on each level properly. • Must represent & manipulate large order solutions. • Must preserve alternative partial solutions (chunks). 46 Hierarchical BOA (hBOA) Pelikan & Goldberg (2001) Proper decomposition Chunking Preservation of alternative solutions • Use BNs as BOA. • Use local structures in BNs. • Restricted tournament replacement (niching). 47 Local Structures in BNs Look at one conditional dependency. • 2k probabilities for k parents. Why not use more powerful representations for conditional probabilities? X1 X2 X3 X2X3 P(X1=0|X2X3) 00 26 % 01 44 % 10 15 % 11 15 % 48 Local Structures in BNs Look at one conditional dependency. • 2k probabilities for k parents. Why not use more powerful representations for conditional probabilities? X2 X1 X2 0 X3 X3 0 26% 1 15% 1 44% 49 Restricted Tournament Replacement Used in hBOA for niching. Insert each new candidate solution x like this: • Pick random subset of original population. • Find solution y most similar to x in the subset. • Replace y by x if x is better than y. 50 hBOA: Scalability Solves nearly decomposable and hierarchical problems (Simon, 1968) Number of evaluations grows as a low-order polynomial Most other methods fail to solve many such problems 51 Hierarchical Traps Traps on multiple levels. Blocks of 0s and 1s mapped to form solutions on the next level. 000 3 challenges • • • Many local optima 000 Deception everywhere No single-level decomposability 111 000 111 000 111 52 Hierarchical Traps hBOA O(n1.63 log(n)) Number of Evaluations 6 10 5 10 4 10 27 81 243 Problem Size 729 53 Other Similar Algorithms Estimation of distribution algorithms (EDAs) • Dynamic branch of evolutionary computation Examples: • • • • • PBIL (Baluja, 1995) • Univariate distributions (full independence) COMIT • Considers tree models ECGA • Groups of variables considered together EBNA (Etxeberria et al., 1999), LFDA (Muhlenbein et al., 1999) • Versions of BOA And others… 54 EDAs: Promising Results Artificial classes of problems MAXSAT, SAT (Pelikan, 2005). Nurse scheduling (Li, Aickelin, 2003) Military antenna design (Santarelli et al., 2004) Groundwater remediation design (Arst et al., 2004) Forest management (Ducheyne et al., 2003) Telecommunication network design (Rothlauf, 2002) Graph partitioning (Ocenasek, Schwarz, 1999; Muehlenbein, Mahnig, 2002; Baluja, 2004) Portfolio management (Lipinski, 2005) Quantum excitation chemistry (Sastry et al., 2005) 55 Current Projects Algorithm design • • • • • • hBOA for computer programs. hBOA for geometries (distance/angle-based). hBOA for machine learners and data miners. hBOA for scheduling and permutation problems. Efficiency enhancement for EDAs. Multiobjective EDAs. Applications • • • Cluster optimization and spin glasses. Data mining. Learning classifier systems & neural networks. 56 Conclusions for Researchers Principled design of practical BBOers: Facetwise design and little models • Scalability • Robustness • Solution to broad classes of problems • Useful for approaching research in evol. comp. • Allow creation of practical algorithms & theory 57 Conclusions for Practitioners BOA and hBOA revolutionary optimizers • Need no parameters to tune. • Need almost no problem specific knowledge. • But can incorporate knowledge in many forms. • Problem regularities discovered and exploited • • • automatically. Solves broad classes of challenging problems. Even problems unsolvable by any other BBOer. Can deal with noise & multiple objectives. 58 Book on hBOA Martin Pelikan (2005) Hierarchical Bayesian optimization algorithm: Toward a new generation of evolutionary algorithms Springer 59 Contact Martin Pelikan Dept. of Math. and Computer Science, 320 CCB University of Missouri at St. Louis 8001 Natural Bridge Rd. St. Louis, MO 63121 [email protected] http://www.cs.umsl.edu/~pelikan/ 60 Problem 1: Concatenated Traps Partition input binary strings into 5-bit groups. Partitions fixed but uknown. Each partition contributes the same. Contributions sum up. trap(u) 5 4 3 2 1 0 0 1 2 3 Number of ones, u 4 5 61 Number of Evaluations Concatenated 5-bit Traps 500000 450000 400000 Experiment Theory 350000 300000 250000 200000 150000 100000 100 125 150 175 200 225 250 Problem Size 62 Spin Glasses: Problem Definition 1D, 2D, or 3D grid of spins. Each spin can take values +1 or -1. Relationships between neighboring spins (i,j) are defined by coupling constants Ji,j. Usually periodic boundary conditions (toroid). Task: Find values of spins to minimize the energy E si J i , j s j i, j 63 Spin Glasses as Constraint Satisfaction ≠ ≠ ≠ = = ≠ = ≠ ≠ = Spins: Constraints: ≠ = ≠ ≠ 64 Spin Glasses: Problem Difficulty 1D – Easy, set spins sequentially. 2D – Several polynomial methods exist, best is O n 3.5 • • Exponentially many local optima Standard approaches (e.g. simulated annealing, MCMC) fail 3D – NP-complete, even for couplings {-1,0,+1}. Often random subclasses are considered • • +-J spin glasses: Couplings uniform -1 or +1 Gaussian spin glasses: Couplings N(0, 2). 65 Number of Evaluations Ising Spin Glasses (2D) hBOA O(n1.51) 3 10 64 100 144 196 Problem Size 256 324 400 66 Results on 2D Spin Glasses Number of evaluations is O(n1.51). Overall time is O(n3.51). Compare O(n3.51) to O(n3.5) for best method (Galluccio & Loebl, 1999) Great also on Gaussians. 67 Ising Spin Glasses (3D) 6 10 Number of Evaluations Experimental average O(n3.63 ) 5 10 4 10 3 10 64 125 216 343 Problem Size 68 MAXSAT Given a CNF formula. Find interpretation of Boolean variables that maximizes the number of satisfied clauses. (x2 x7 x5 ) (x1 x4 x3) 69 MAXSAT Difficulty MAXSAT is NP complete for k-CNF, k>1 But “random” problems are rather easy for almost any method. Many interesting subclasses on SATLIB, e.g. • 3-CNF from phase transition ( c = 4.3 n ) • CNFs from other problems (graph coloring, …) 70 MAXSAT: Random 3CNFs 71 MAXSAT: Graph Coloring 500 variables, 3600 clauses From “morphed” graph coloring (Toby Walsh) # 1 hBOA+GSAT 1,262,018 WalkSAT > 40 mil. 2 3 4 1,099,761 1,123,012 1,183,518 > 40 mil. > 40 mil. > 40 mil. 5 6 1,324,857 1,629,295 > 40 mil. > 40 mil. 72 Spin Glass to MAXSAT Convert each coupling Jij with spins si and sj: Jij =+1 (si sj) (si sj) Jij = -1 (si sj) (si sj) Consistent pairs of spins = 2 sat. clauses Inconsistent pairs of spins = 1 sat. clause MAXSAT solvers perform poorly even in 2D! 73