Download DOWNLOAD PPT file

Document related concepts
no text concepts found
Transcript
Hierarchical Bayesian
Optimization Algorithm (hBOA)
Martin Pelikan
University of Missouri at St. Louis
[email protected]
Foreword

Motivation
•
•
•
•

Black-box optimization (BBO) problem
• Set of all potential solutions
• Performance measure (evaluation procedure)
Task: Find optimum (best solution)
Formulation useful: No need for gradient, numerical functions, …
But many important and tough challenges
This talk
•
•
Combine machine learning and evolutionary computation
Create practical and powerful optimizers (BOA and hBOA)
2
Overview




Black-box optimization (BBO)
BBO via probabilistic modeling
• Motivation and examples
• Bayesian optimization algorithm (BOA)
• Hierarchical BOA (hBOA)
Theory and experiment
Conclusions
3
Black-box Optimization



Input
•
•
How do potential solutions look like?
How to evaluate quality of potential solutions?
Output
•
Best solution (the optimum)
Important
•
•
•
We don’t know what’s inside evaluation procedure
Vector and tree representations common
This talk: Binary strings of fixed length
4
BBO: Examples



Atomic cluster optimization
•
•
Solutions: Vectors specifying positions of all atoms
Performance: Lower energy is better
Telecom network optimization
•
•
Solutions: Connections between nodes (cities, …)
Performance: Satisfy constraints, minimize cost
Design
•
•
Solutions: Vectors specifying parameters of the design
Performance: Finite element analysis, experiment, …
5
BBO: Advantages & Difficulties


Advantages
•
•
Use same optimizer for all problems.
No need for much prior knowledge.
Difficulties
•
•
•
•
Many places to go
•
•
100-bit strings…1267650600228229401496703205376 solutions.
Enumeration is not an option.
Many places to get stuck
•
Local operators are not an option.
Must learn what’s in the box automatically.
Noise, multiple objectives, interactive evaluation, ...
6
Typical Black-Box Optimizer



Sample solutions
Evaluated sampled solutions
Learn to sample better
Evaluate
Sample
Learn
7
Many Ways to Do It

Hill climber

Simulated annealing

Evolutionary algorithms
• Start with a random solution.
• Flip bit that improves the solution most.
• Finish when no more improvement possible.
• Introduce Metropolis.
• Inspiration from natural evolution and genetics.
8
Evolutionary Algorithms



Evolve a population of candidate solutions.
Start with a random population.
Iteration
• Selection
•
•
Select promising solutions
Variation
Apply crossover and mutation to selected solutions
Replacement
Incorporate new solutions into original population
9
Estimation of Distribution Algorithms

Replace standard variation operators by
• Building a probabilistic model of promising
•

solutions
Sampling the built model to generate new solutions
Probabilistic model
• Stores features that make good solutions good
• Generates new solutions with just those features
10
EDAs
Selected
population
New
population
11001
11001
01111
10101
10101
Probabilistic
11001
01011
01011
Model
11011
11000
11000
Current
population
00111
11
What Models to Use?

Our plan
• Simple example: Probability vector for binary strings
• Bayesian networks (BOA)
• Bayesian networks with local structures (hBOA)
12
Probability Vector





Baluja (1995)
Assumes binary strings of fixed length
Stores probability of a 1 in each position.
New strings generated with those proportions.
Example:
(0.5, 0.5, …, 0.5) for uniform distribution
(1, 1, …, 1) for generating strings of all 1s
13
EDA Example: Probability Vector
Current
population
Selected
population
New
population
11001
11001
10101
10101
10101
01011
01011
11101
11000
11000
11001
1.0 0.5 0.5 0.0 1.0
11001
10001
10101
01011
11000
14
Probability Vector Dynamics




Bits that perform better get more copies.
And are combined in new ways.
But context of each bit is ignored.
Example problem 1: ONEMAX
n
f ( X 1 , X 2 , , X n )   X i
i 1

Optimum: 111…1
15
Probability Vector on ONEMAX
1s
of entries
Proportions
Probability
vector
1
0.9
0.8
0.7
Optimum
0.6
0.5
0.4
0.3
0.2
0.1
0
0
10
20
30
40
50
Generation
Iteration
16
Probability Vector on ONEMAX
1s
of entries
Proportions
Probability
vector
1
0.9
0.8
0.7
Optimum
0.6
Success
0.5
0.4
0.3
0.2
0.1
0
0
10
20
30
40
50
Generation
Iteration
17
Probability Vector: Ideal Scale-up

O(n log n) evaluations until convergence

Other algorithms
• (Harik, Cantú-Paz, Goldberg, & Miller, 1997)
• (Mühlenbein, Schlierkamp-Vosen, 1993)
• Hill climber: O(n log n) (Mühlenbein, 1992)
• GA with uniform: approx. O(n log n)
• GA with one-point: slightly slower
18
When Does Prob. Vector Fail?

Example problem 2: Concatenated traps
• Partition input string into disjoint groups of 5 bits.
• Each group contributes via trap (ones=num. ones):
if ones  5
5
trap(ones)  
 4  ones otherwise
• Concatenated trap = sum of single traps
• Optimum: 111…1
19
Trap
5
Global
optimum
trap(u)
4
3
2
1
0
0
1
2
3
Number
ofofones,
Number
1s
u
4
5
20
1s
of entries
Proportions
Probability
vector
Probability Vector on Traps
1
0.9
0.8
0.7
Optimum
0.6
0.5
0.4
0.3
0.2
0.1
0
0
10
20
30
40
50
Generation
Iteration
21
1s
of entries
Proportions
Probability
vector
Probability Vector on Traps
1
0.9
0.8
0.7
Optimum
0.6
Failure
0.5
0.4
0.3
0.2
0.1
0
0
10
20
30
40
50
Generation
Iteration
22
Why Failure?

Onemax:
•
•
Optimum in 111…1
1 outperforms 0 on average.

Traps: optimum in 11111, but

So single bits are misleading.
• f(0****) = 2
• f(1****) = 1.375
23
How to Fix It?




Consider 5-bit statistics instead of 1-bit ones.
Then, 11111 would outperform 00000.
Learn model
•
Compute p(00000), p(00001), …, p(11111)
Sample model
•
•
Sample 5 bits at a time
Generate 00000 with p(00000),
00001 with p(00001), …
24
Correct Model on Traps: Dynamics
1s
Proportions
Probabilities
of of11111
1
0.9
0.8
0.7
Optimum
0.6
0.5
0.4
0.3
0.2
0.1
0
0
10
20
30
40
50
Generation
Iteration
25
Correct Model on Traps: Dynamics
1s
Proportions
Probabilities
of of11111
1
0.9
0.8
0.7
Optimum
0.6
Success
0.5
0.4
0.3
0.2
0.1
0
0
10
20
30
40
50
Generation
Iteration
26
Good News: Good Stats Work Great!



Optimum in O(n log n) evaluations.
Same performance as on onemax!
Others
• Hill climber: O(n5 log n) = much worse.
• GA with uniform: O(2n) = intractable.
• GA with one point: O(2n) (without tight linkage).
27
Challenge

If we could learn and use context for each
position
• Find nonmisleading statistics.
• Use those statistics as in probability vector.

Then we could solve problems
decomposable into statistics of order at most
k with at most O(n2) evaluations!
• And there are many of those problems.
28
Bayesian Optimization Algorithm (BOA)



Pelikan, Goldberg, & Cantú-Paz (1998)
Use a Bayesian network (BN) as a model.
Bayesian network
• Acyclic directed graph.
• Nodes are variables (string positions).
• Conditional dependencies (edges).
• Conditional independencies (implicit).
29
Conditional Dependency
Z
Y
X
X
0
0
0
0
1
1
1
1
Y
0
0
1
1
0
0
1
1
Z P(X | Y, Z)
0
10%
1
5%
0
25%
1
94%
0
90%
1
95%
0
75%
1
6%
30
Bayesian Network (BN)



Explicit: Conditional dependencies.
Implicit: Conditional independencies.
Probability tables
31
BOA
Current
population
Selected
population
Bayesian
network
New
population
32
BOA Variation

Two steps
• Learn a Bayesian network (for promising solutions)
• Sample the built Bayesian network (to generate
new candidate solutions)

Next
• Brief look at the two steps in BOA
33
Learning BNs

Two components:
• Scoring metric (to evaluate models).
• Search procedure (to find the best model).
34
Learning BNs: Scoring Metrics

Bayesian metrics
•
Bayesian-Dirichlet with likelihood equivalence
(m'( i ))
(m'(xi ,  i )  m(xi ,  i ))
BD(B)  p(B) 

(m'(

)

m(

))
(m'(xi ,  i ))
i1 
xi
i
i
n
i

Minimum description length metrics
•
Bayesian information criterion (BIC)

i log 2 N 
BIC(B)     H( X i | i )N  2
2 
i1 
n
35
Learning BNs: Search Procedure




Start with an empty network (like prob. vec.).
Execute primitive operator that improves the
metric the most.
Until no more improvement possible.
Primitive operators
• Edge addition
• Edge removal
• Edge reversal.
36
Sampling BNs: PLS


Probabilistic logic sampling (PLS)
Two phases
• Create ancestral ordering of variables:
•
Each variable depends only on predecessors
Sample all variables in that order using CPTs:
Repeat for each new candidate solution
37
BOA Theory: Key Components




Primary target: Scalability
Population sizing N
•
How large populations for reliable solution?
Number of generations (iterations) G
•
How many iterations until convergence?
Overall complexity
•
•
O(N x G)
Overhead: Low-order polynomial in N, G, and n.
38
BOA Theory: Population Sizing





Assumptions: n bits, subproblems of order k
Initial supply (Goldberg)
•
Have enough partial sols. to combine.
Decision making (Harik et al, 1997)
•
Decide well between competing partial sols.
Drift (Thierens, Goldberg, Pereira, 1998)
•
Don’t lose less salient stuff prematurely.
Model building (Pelikan et al., 2000, 2002)
•
Find a good model.

O 2k log n
O

 n log n

O n
 
O n1.55
39
BOA Theory: Num. of Generations


Two bounding cases
Uniform scaling
•
•
Subproblems converge in parallel
Onemax model (Muehlenbein & Schlierkamp-Voosen, 1993)
O

 n
Exponential scaling
•
•
Subproblems converge sequentially
Domino convergence (Thierens, Goldberg, Pereira, 1998)

O n
40
Good News

Theory
•
•
Population sizing (Pelikan et al., 2000, 2002)
1.
2.
3.
4.
Initial supply.
Decision making.
Drift.
Model building.
Iterations until convergence (Pelikan et al., 2000, 2002)
1. Uniform scaling.
2. Exponential scaling.

O(n) to O(n1.05)
O(n0.5) to O(n)
BOA solves order-k decomposable problems
in O(n1.55) to O(n2) evaluations!
41
Number of Evaluations
Theory vs. Experiment (5-bit Traps)
500000
450000
400000
Experiment
Theory
350000
300000
250000
200000
150000
100000
100
125
150
175
200
225
250
Problem Size
42
Additional Plus: Prior Knowledge

BOA need not know much about problem

BOA can use prior knowledge
• Only set of solutions + measure (BBO).
• High-quality partial or full solutions.
• Likely or known interactions.
• Previously learned structures.
• Problem specific heuristics, search methods.
43
From Single Level to Hierarchy



What if problem can’t be decomposed like this?
Inspiration from human problem solving.
Use hierarchical decomposition
• Decompose problem on multiple levels.
• Solutions from lower levels = basic building blocks
•
for constructing solutions on the current level.
Bottom-up hierarchical problem solving.
44
Hierarchical Decomposition
Car
Engine
Fuel system
Braking system
Valves
Electrical system
Ignition system
45
3 Keys to Hierarchy Success

Proper decomposition

Chunking

Preservation of alternative solutions
• Must decompose problem on each level properly.
• Must represent & manipulate large order solutions.
• Must preserve alternative partial solutions (chunks).
46
Hierarchical BOA (hBOA)

Pelikan & Goldberg (2001)
Proper decomposition

Chunking

Preservation of alternative solutions

• Use BNs as BOA.
• Use local structures in BNs.
• Restricted tournament replacement (niching).
47
Local Structures in BNs


Look at one conditional dependency.
•
2k probabilities for k parents.
Why not use more powerful representations
for conditional probabilities?
X1
X2
X3
X2X3
P(X1=0|X2X3)
00
26 %
01
44 %
10
15 %
11
15 %
48
Local Structures in BNs


Look at one conditional dependency.
•
2k probabilities for k parents.
Why not use more powerful representations
for conditional probabilities?
X2
X1
X2
0
X3
X3
0
26%
1
15%
1
44%
49
Restricted Tournament Replacement


Used in hBOA for niching.
Insert each new candidate solution x like this:
• Pick random subset of original population.
• Find solution y most similar to x in the subset.
• Replace y by x if x is better than y.
50
hBOA: Scalability



Solves nearly decomposable and
hierarchical problems (Simon, 1968)
Number of evaluations grows as a low-order
polynomial
Most other methods fail to solve many such
problems
51
Hierarchical Traps



Traps on multiple levels.
Blocks of 0s and 1s mapped
to form solutions on the
next level.
000
3 challenges
•
•
•
Many local optima
000
Deception everywhere
No single-level decomposability
111
000
111
000
111
52
Hierarchical Traps
hBOA
O(n1.63 log(n))
Number of Evaluations
6
10
5
10
4
10
27
81
243
Problem Size
729
53
Other Similar Algorithms


Estimation of distribution algorithms (EDAs)
•
Dynamic branch of evolutionary computation
Examples:
•
•
•
•
•
PBIL (Baluja, 1995)
•
Univariate distributions (full independence)
COMIT
•
Considers tree models
ECGA
•
Groups of variables considered together
EBNA (Etxeberria et al., 1999), LFDA (Muhlenbein et al., 1999)
•
Versions of BOA
And others…
54
EDAs: Promising Results










Artificial classes of problems
MAXSAT, SAT (Pelikan, 2005).
Nurse scheduling (Li, Aickelin, 2003)
Military antenna design (Santarelli et al., 2004)
Groundwater remediation design (Arst et al., 2004)
Forest management (Ducheyne et al., 2003)
Telecommunication network design (Rothlauf, 2002)
Graph partitioning (Ocenasek, Schwarz, 1999; Muehlenbein,
Mahnig, 2002; Baluja, 2004)
Portfolio management (Lipinski, 2005)
Quantum excitation chemistry (Sastry et al., 2005)
55
Current Projects


Algorithm design
•
•
•
•
•
•
hBOA for computer programs.
hBOA for geometries (distance/angle-based).
hBOA for machine learners and data miners.
hBOA for scheduling and permutation problems.
Efficiency enhancement for EDAs.
Multiobjective EDAs.
Applications
•
•
•
Cluster optimization and spin glasses.
Data mining.
Learning classifier systems & neural networks.
56
Conclusions for Researchers

Principled design of practical BBOers:

Facetwise design and little models
• Scalability
• Robustness
• Solution to broad classes of problems
• Useful for approaching research in evol. comp.
• Allow creation of practical algorithms & theory
57
Conclusions for Practitioners

BOA and hBOA revolutionary optimizers
• Need no parameters to tune.
• Need almost no problem specific knowledge.
• But can incorporate knowledge in many forms.
• Problem regularities discovered and exploited
•
•
•
automatically.
Solves broad classes of challenging problems.
Even problems unsolvable by any other BBOer.
Can deal with noise & multiple objectives.
58
Book on hBOA
Martin Pelikan (2005)
Hierarchical Bayesian optimization algorithm:
Toward a new generation of evolutionary algorithms
Springer
59
Contact
Martin Pelikan
Dept. of Math. and Computer Science, 320 CCB
University of Missouri at St. Louis
8001 Natural Bridge Rd.
St. Louis, MO 63121
[email protected]
http://www.cs.umsl.edu/~pelikan/
60
Problem 1: Concatenated Traps



Partition input binary strings
into 5-bit groups.
Partitions fixed but uknown.
Each partition contributes
the same.
Contributions sum up.
trap(u)

5
4
3
2
1
0
0
1
2
3
Number of ones, u
4
5
61
Number of Evaluations
Concatenated 5-bit Traps
500000
450000
400000
Experiment
Theory
350000
300000
250000
200000
150000
100000
100
125
150
175
200
225
250
Problem Size
62
Spin Glasses: Problem Definition





1D, 2D, or 3D grid of spins.
Each spin can take values +1 or -1.
Relationships between neighboring spins (i,j) are
defined by coupling constants Ji,j.
Usually periodic boundary conditions (toroid).
Task: Find values of spins to minimize the energy
E   si J i , j s j
i, j
63
Spin Glasses as
Constraint Satisfaction
≠
≠
≠
=
=
≠
=
≠
≠
=
Spins:
Constraints: ≠ =
≠
≠
64
Spin Glasses: Problem Difficulty


1D – Easy, set spins sequentially.
2D – Several polynomial methods exist, best is
 
O n 3.5
•
•


Exponentially many local optima
Standard approaches (e.g. simulated annealing, MCMC) fail
3D – NP-complete, even for couplings {-1,0,+1}.
Often random subclasses are considered
•
•
+-J spin glasses: Couplings uniform -1 or +1
Gaussian spin glasses: Couplings N(0, 2).
65
Number of Evaluations
Ising Spin Glasses (2D)
hBOA
O(n1.51)
3
10
64
100
144
196
Problem Size
256 324 400
66
Results on 2D Spin Glasses




Number of evaluations is O(n1.51).
Overall time is O(n3.51).
Compare O(n3.51) to O(n3.5) for best method
(Galluccio & Loebl, 1999)
Great also on Gaussians.
67
Ising Spin Glasses (3D)
6
10
Number of Evaluations
Experimental average
O(n3.63 )
5
10
4
10
3
10
64
125
216
343
Problem Size
68
MAXSAT


Given a CNF formula.
Find interpretation of Boolean variables that
maximizes the number of satisfied clauses.
(x2  x7 x5 )  (x1  x4  x3)
69
MAXSAT Difficulty

MAXSAT is NP complete for k-CNF, k>1

But “random” problems are rather easy for
almost any method.

Many interesting subclasses on SATLIB, e.g.
• 3-CNF from phase transition ( c = 4.3 n )
• CNFs from other problems (graph coloring, …)
70
MAXSAT: Random 3CNFs
71
MAXSAT: Graph Coloring


500 variables, 3600 clauses
From “morphed” graph coloring (Toby Walsh)
#
1
hBOA+GSAT
1,262,018
WalkSAT
> 40 mil.
2
3
4
1,099,761
1,123,012
1,183,518
> 40 mil.
> 40 mil.
> 40 mil.
5
6
1,324,857
1,629,295
> 40 mil.
> 40 mil.
72
Spin Glass to MAXSAT

Convert each coupling Jij with spins si and sj:
Jij =+1  (si  sj)  (si  sj)



Jij = -1  (si  sj)  (si  sj)
Consistent pairs of spins = 2 sat. clauses
Inconsistent pairs of spins = 1 sat. clause
MAXSAT solvers perform poorly even in 2D!
73