Download Computable Rate of Convergence in Evolutionary Computation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Pattern recognition wikipedia , lookup

Hidden Markov model wikipedia , lookup

Gene expression programming wikipedia , lookup

Multi-armed bandit wikipedia , lookup

Genetic algorithm wikipedia , lookup

Transcript
Computable Rate of Convergence in Evolutionary
Computation*
David R. Stark
The Johns Hopkins University
Applied Physics Laboratory
11100 Johns Hopkins Road
Laurel, Maryland 20723-6099 U.S.A.
[email protected]
James C. Spall
The Johns Hopkins University
Applied Physics Laboratory
11100 Johns Hopkins Road
Laurel, Maryland 20723-6099 U.S.A.
[email protected]
Abstract. The broad field of evolutionary computation
(EC)—including genetic algorithms as a special case—has
attracted much attention in the last several decades. Many
bold claims have been made about the effectiveness of
various EC algorithms. These claims have centered on the
efficiency, robustness, and ease of implementation of EC
approaches. Unfortunately, there seems to be little theory
to support such claims. One key step to formally evaluating
or substantiating such claims is to establish rigorous results
on the rate of convergence of EC algorithms. This paper
presents a computable rate of convergence for a class of
ECs that includes the standard genetic algorithm as a
special case.
Key words: Genetic algorithms; Markov chains; stochastic
optimization
1. INTRODUCTION
Evolutionary computation (EC) algorithms have
attracted much attention for solving search and optimization
problems. The fundamental characteristics of EC
algorithms are the propagation of a population of potential
solutions and the use of some Monte Carlo form of
simulated evolution. The selection scheme is such that the
most favorable elements of the population tend to survive
into the next generation while the unfavorable elements tend
to perish.
There have been many reported successes of
specific EC algorithms, most notably various version of the
genetic algorithm (GA), introduced in modern form in
Holland (1975). What has been lacking in much EC work,
however, is a theoretical foundation for characterizing the
performance. This paper develops a method for
ISIF © 2002
characterizing the rate of convergence of a class of EC
algorithms, including the standard GA. The rate presented
here provides a quantifiable measure of the performance of
ECs for a range of possible implementations.
The basic optimization problem considered here
*
corresponds to finding an optimal point θ :
*
θ = arg min L (θ )
(1.1)
θ∈D
where L(θ) is the loss function to be minimized, D is the
domain over which the search will occur, and θ is a pdimensional (say) vector of parameters.
Historically, there have been three general
approaches in EC, namely evolutionary programming (EP),
evolutionary strategies (ES) and the above-mentioned GAs.
The principle differences in the three approaches are the
selection of evolutionary operators used to perform the
search and the computer representation of the candidate
solutions. EP uses selection and mutation only to generate
new solutions. While both ES and GA use selection,
recombination and mutation, recombination is used more
extensively in GA. A GA traditionally performs
evolutionary operations using binary encoding of the
solution space, while EP and ES perform the operations
using real-coded solutions. The GA also has a real-coded
form and there is some indication that the real-coded GA
may be more efficient and provide greater precision than the
binary-coded GA. The distinction among the three
approaches has begun to blur as new hybrid versions of EC
algorithms have arisen.
The formal convergence of EC algorithms to the
*
optimal θ has been considered in a number of references.
Eiben, et al. (1991) derived a convergence-in-probability
result for a GA with elitist selection using Markov chain
88
theory assuming a finite search space. This result
characterized the convergence properties of the GA in terms
of the selection, mutation, and recombination probabilities.
Rudolph (1994) analyzed the canonical GA without elitist
selection in the binary search space. He found that the
canonical GA will not, in general, converge to the global
optimum, and that convergence for this GA comes only by
saving the best solution obtained over the course of the
search. For function optimization it makes sense to keep the
best solution obtained over the course of the search, so
based on the analysis by Rudolph (1994) global
convergence is guaranteed. Note that convergence results in
a finite domain D based on binary-bit representations are per
se of limited value. Because there are a finite number of
points to search, naïve approaches such as random search
and simple enume ration are also guaranteed to converge.
However, since convergence is a precondition for
convergence rate calculations, convergence results in a
finite search space are not entirely meaningless. Rudolph
(1998) summarizes the sufficient conditions on the
mutation, recombination, and selection operators for global
convergence of EC algorithms in finite search spaces, with a
simplified mathematical structure that does not rely on
Markov chain theory. A relatively recent survey of EC
convergence theory is in Rudolph (1997a).
Global convergence results can be given for a
broad class of problems, but the same can not be said for
convergence rates. The mathematical complexity of
analyzing EC convergence rates is significant. Determining
specifically how many generations of the population are
required in order to ensure a certain error in the solution is
apparently an open problem for arbitrary loss functions.
Vose (1997, 1999) showed that with an infinite population
size, and for every 0 < δ < 1, the number of generations
required for the GA to come within a distance δ of θ∗ is
O(−log δ). Aside from the infinite population size
requirement, this result is not directly usable in our
comparison however since it does not give a quantifiable
expression for the number of generations required to
guarantee that the best population element will be within
*
some δ distance of θ . Other rate of convergence results
exist, but these are generally very restrictive (typically
applying when only two of the three standard EC
operations—selection, mutation, and recombination—apply
and/or when the algorithms are being used on only the
simple spherical loss function L(θ) = ||θ|| 2 ). See, for
example, Qi and Palmeiri (1994), Beyer (1995), or Rudolph
(1997a, 1997b). Recently Leung et. al. (2001) have
developed a model for Evolutionary Algorithms that
provides convergence and convergence speeds for nonelitist EC algorithms. Their model does not however
provide a computable rate of convergence.
Our approach differs from those above. We
initially cast the chosen EC algorithm as a Markov process.
Then using information about the transition probabilities
available by user-specified coefficients associated with
selection, mutation, and recombination, we develop a steady
state distribution for the EC population that applies in the
limit as the number of generations increase. A rate at which
the finite-generation distribution of the population
converges to the stationary distribution is given in Suzuki
(1995). Finally, to compensate for the inaccuracy of the
stationary distribution in finite samples, we present an
adjustment that allows for the approximate calculation of
the population distribution at any finite generation. It is
expected that such knowledge will provide users key
knowledge on the expected performance of a given EC
implementation before the user invests significant resources
in a practical implementation. It is also expected that this
result will allow for a generalization of the algorithm
comparison carried out in Spall, et al. (2000).
Section 2 establishes some notation associated with
our EC setting and presents the matrix of transition
probabilities for the simple GA, and presents the steadystate distribution of the GA in terms of the GA parameters.
Section 3 presents a path toward using the stationary
distribution to form a computable bound on the rate of
convergence. The appendix presents some details related to
the Markov chain analysis of the GA.
2. GA MARKOV CHAIN BACKGROUND
Recall the basic problem in (1.1). While there are numerous
variations in the GAs used by practitioners, in the traditional
GA the nominally real vector of parameters is converted to
its binary equivalent, so that the search space is the
collection of binary strings of length l. Thus there are 2l
possible strings. Since the search space for a GA is
nominally over a finite collection of binary strings, the
candidate solutions may also be represented by a collection
l
of integers, r ∈ 0, 1, 2, ..., 2 − 1 . Each value of r
represents a unique label for one of the 2l possible strings.
The number of candidate solutions at each generation is
called the population size and is denoted by Npop . The total
number of possible populations is given by
{
}
+ 2l − 1
pop
.
N=
(2.1)
( 2l − 1)! N
!
pop
An initial population of candidate solutions is
selected at random. The basic GA search proceeds as
follows:
N
1.
89
The fitness of each of the Npop candidate solutions in the
population is determined from the fitness function (the
2.
3.
4.
5.
loss function is usually transformed into a fitness
function).
Based on the fitness of the candidate solutions, the
parents for the next population are selected.
Some form of reproduction among the parents takes
place. This is usually called crossover and in its
simplest form two parents are selected with some
probability for crossover (called the crossover rate) and
a random position in the parents’ binary string is
chosen. The offspring of the two parents are formed by
exchanging the tails of the binary strings at the
randomly selected position.
Mutation corresponds to flipping of bits of an
individual with some small probability (called the
mutation rate).
The candidates left after these four steps form the gene
pool for the next population.
The cycle repeats through successive generations until some
stopping criterion is reached.
The GA may be modeled as a Markov chain so that
all of the tools of Markov chain theory may be used to
explore the properties of GAs. Nix and Vose (1992)
showed that the stochastic transition through these genetic
operations can be described completely by the transition
matrix for one generation. If we assume that the GA
operates on each generation to produce the next, then the
N × N Markov Chain transition matrix Q has elements
Qi , j = conditional probability that population j is generated
from population i where each possible individual
l
r ∈ 0,1,2,...,2 − 1 is included s(r,i) times in population i.
{
}
As shown by Nix and Vose (1992) the elements of
the transition matrix give the probability of producing
population j from population i and may be written as a
multinomial distribution given by
2l −1 [π ( r )] s ( r , j )
i
Qi, j = N pop ! ∏
,
s( r, j)!
r=0
there is no mechanism for preserving the best solution into
the next generation the best solution will typically be lost
through the recombination and mutation operations.
Rudolph (1998) shows that a GA that ensures that the best
solution in the population will survive the selection process
will converge globally. Suzuki (1995) derives Q for a
modified elitist GA that satisfies the criteria for
convergence. In the modified elitist GA, Npop −1 candidate
solutions are produced at each generation and the candidate
solution with the highest fitness from the previous
generation is assured to survive. In the sequel we will be
using this form of the GA. The reader should refer to the
appendix for a complete description of Q in terms of the GA
parameters for the modified elitist algorithm.
Of primary interest in analyzing the performance of
GA algorithms using Markov chains is the stationary
distribution of the GA. Davis (1991) shows that the
stationary distribution for the simple GA with both
mutation, crossover and selection operators exists, and it is
unique. Let q k be a 1 × N row vector having jth
component, q k(j) equal to the probability that the kth
generation will result in population j. From the definition
of the transition matrix, q k = q 0 Qk where q 0 is an initial
probability distribution.
The stationary distribution of the GA is then given
by
k
q = lim q k = lim q 0Q .
(2.3)
k→∞
k →∞
From this we see that q is a solution to q = qQ. If we can
solve directly for q then we will have the stationary
distribution of the GA in terms of the GA parameters, since
Q is completely specified in terms of the GA parameters
through the πi ( y ) . The method of solving for q k(j) is well
known (Iosifescu, 1980, pp. 123-124).
q( j ) =
(2.2)
where πi ( r ) is the probability of producing string r from
population i.
Clearly, πi (r ) depends upon the algorithm
parameters, namely the mutation rate, convergence rate and
the type of selection used (which normally depends upon the
objective function). The task of computing πi (r ) is
formidable for most forms of the GA. Fortunately,
however, this problem was solved for the canonical GA by
Nix and Vose (1992) using the model of the GA developed
from Vose and Liepins (1991). Unfortunately this form of
the canonical GA does not converge to the global optimum,
as shown by Rudolph (1994). The canonical GA typically
finds the global solution at some generation(s), but because
Dj
(2.4)
N −1
∑ D
i =0 i
where Dj is the jth cofactor of the main diagonal entries of
the matrix I − Q. The cofactors Dj are found by taking the
determinant of the matrix obtained from I − Q by
removing the jth row and column.
3. THE STATIONARY DISTRIBUTION
APPLIED TO CONVERGENCE ANALYSIS
The analysis in Sections 2 provides a means for
obtaining the stationary distribution for the population
elements of a GA. This analysis, however, does not directly
answer a question such as “What is the likelihood of having
at least one population element correspond to a solution θ in
90
*
the search space of interest close to the optimal θ ?” There
are two reasons why this question is not directly answered
in the previous analysis: (1) The prior analysis provides
limiting probabilities for the bit-based representation of the
GA, not the corresponding floating point representation that
is of direct physical interest in solving problem (1.1) and (2)
The stationary probability for one generation does not
directly describe the probability of obtaining a “good”
solution in at least one generation across a number of
iterations. This section addresses these issues in developing
an approach that applies in a realistic GA running multiple
generations.
Before proceeding, note that the use of the
stationary distribution is justified under the convergence
result in Suzuki (1995). In particular, for the regular GA of
interest here, Suzuki (1995) presents results related to the
stationary probabilities associated with the likelihood of
realizing “good” specific population elements in a
population. It is shown that the probability that the
population contains the individual with the highest fitness
value (corresponding to the θ with the lowest loss value)
approaches 1 − O(a n ) where 0 < a < 1.
*
corresponding to θ ∈ S( θ ) first meets or exceeds 1 − ρ, we
have met the requirement for “satisfactory convergence.”
Suppose that we count the possible multiple evaluations of
the fitness function at a particular bit string that result
through the operations of a GA (i.e., the same bit string
appears more than once in the course of the GA operations)
as separate function evaluations. This is usually the way
function evaluations are counted in a GA because it is not
typically worth the software overhead to store a function
evaluation for future use. Then, the number of function
evaluations needed in response to question A above is
Npop +(K − 1)×(Npop − 1), where we assume that the initial
generation computes Npop fitness values and all subsequent
generations compute only Npop − 1 fitness values (due to the
preservation of the best value from one generation to the
next).
4. ILLUSTRATION
A short example will now follow that illustrates the
convergence calculation for a modified genetic algorithm.
Consider the following loss function from Schwefel (1994):
There are obviously many ways one can express
the rate of convergence. We will address the rate of
convergence by focusing on the question:
L (θ ) = − θ sin
( θ ) , θ∈[0,15] as shown in Figure 1. In
this example there is a local minimum. The optimal
solution is on the boundary at θ?? ?? ?Modified Genetic
Algorithms with the following evolutionary parameters will
be used to find the minimum of the loss function.
Crossover, Mutation, Proportional Selection, and Population
Size.
With some high probability 1− ρ (ρ a small
number), how many L(⋅) function evaluations are needed to
*
achieve a solution θ lying in some “satisfactory set” S( θ )
*
containing θ ?
Note that the above question is in terms of the basic
parameters θ and associated loss function. Although the GA
discussion in Sections 2 and 3 is based on the traditional
notions of bit-based population elements and a fitness
function, there is no fundamental barrier to transitioning
between the formulations. In particular, there is a unique
mapping between the possible values for θ and the “labels”
r associated with each of the 2l possible bit representations.
Given the mapping between θ and r, it is possible to define
*
a set equivalent to the satisfactory set S( θ ) such that when
*
r lies in this set, then θ ∈ S( θ ). Let this equivalent set be
Sr.
Given q k computed according to Section 2, we are
in a position to answer question A. First, we identify the
population elements that correspond to solutions acceptably
*
close to θ . Then, for any k, we can compute q k, giving the
probabilities associated with the populations containing
*
those bit-based elements corresponding to θ near θ . When
we identify a K such that the sum of the probabilities in q K
The convergence rate results for three cases with
varying evolutionary parameters are shown in Table 1. It
should be noted that computing convergence rates for multidimensional, large population-size evolutionary
computation problems using this method is not practical due
to the very large transition matrix Q that would result as
suggested by the number of possible populations calculation
l
( N pop + 2 − 1)!
equation N =
. A perhaps more intuitive
l
(2 − 1)! N pop !
estimate of the size of N can be obtained by Stirling’s
Approximation as follows:

2 π 1 +

N pop
l
2 −1 
N pop


2l −1
 N pop 
1 + 2 l − 1 


 1
 l +
2 −1
1/2


N pop 
1
.
The rate of growth of N is clearly dominated by the first two
terms in the expression above.
91
APPENDIX
0
A complete specification for the transition matrix Q for
the modified elitist algorithm as detailed in Suzuki (1995)
is provided in this appendix.
-2
-4
1.
-6
-8
-10
-12
0
5
10
2.
15
Figure 1. Loss function for use in example. Local
minimum at θ=5; global minimum is at θ=15.
3.
Table 1. Convergence Rate Results for Sample Loss
Function
Probability that the Evolutionary
Algorithm has a population that
contains the optimal solution.
Generation
of Evolution
Algorithm
Crossover
Rate=1.0
Mutation
Rate= 0.05
Npop = 2
l =6
Crossover
Rate=1.0
Mutation
Rate= 0.05
Npop = 4
l =4
Crossover
Rate=1.0
Mutation
Rate= 0.05
Npop = 2
l =4
0
5
10
15
20
30
In the modified elitist strategy, the probability Qk,v
of the transition from the population k to the population v is
given as
40
Qk ,v
0.0
0.2
0.1
0.5
0.2
0.7
0.3
0.8
0.4
0.9
0.7
1.0
Determine the priority prty(r) = 1, 2, ..., 2l for each
possible individual r according to the descending
order of the fitness value, Λ(r) and a predetermined
tie-breaking rule when more than one individual have
the same fitness. The fitness function Λ(r) is the
negative of the loss function L(θ), where the fitness
function has domain as the set of integers and the
domain of the loss function is the set of real numbers.
Assign the identification k=1, 2, ..., N to each
population according to the ascending order of the
minimum number of prty(r* (k)) and to a
predetermined tie-breaking rule when more than one
population have the same highest priority prty(r* (k)),
where r* (k) is the individual with the highest priority
included in each population k; and
The modified elitist strategy reserves the individual
r* (k) with the highest priority prty(r* (k)) in the
population till next generation.
2l −1 π ( j , k )Y ( j , v )

∗
∗
, if prty ( r ( v )) ≤ prty ( r ( k ))
 ( N pop − 1) ! ∏
=
Y ( j ,v ) !
j= 0
∗
∗
0 ,
ifprty ( r ( k )) < prty ( r ( v ))

1.0
1.2
where the individual j occurs s(j,k) times in population k
and s(j,v) times in population v and
∗
 s ( r , v ),
if r ≠ r (k )
Y (r , v ) = 
∗
 s ( r , v ) − 1, if r = r (k )
2l −12l −1
π( j , k ) = ∑ ∑ ϕ (i ⊕ j, k ) ϕ( h ⊕ j, k ) Ri, h
i= 0 h= 0
0.1
0.2
0.3
0.4
0.5
0.7
1.0
ϕ( i, k ) =
92
Λ ( i ) s( i, k )
l
2 −1
∑ Λ ( h ) s (h ,k )
h= 0
Ri ,h =
1
2
l−i i
l− h h 
(1 − χ) (1 −µ )
µ + (1 − µ)
µ



l −1 (2 υ −1)⊗h + i − ( 2υ −1)⊗i
 χ
 l −1 ∑ µ
 υ=1
l − i + (2 υ −1)⊗ i − (2υ −1)⊗h
1 
+
υ
υ
2
χ l −1 (2 −1)⊗i +h − ( 2 −1)⊗h
+
∑ µ

l −1υ=1


l − h + (2υ −1)⊗h − (2 υ −1)⊗i
⋅(1−µ )⋅(1−µ)













ComputationConvergence Analysis and Specifications,”
IEEE Transactions On Evolutionary Computation, vol., 5
no. 1, February 2001
Nix, A. and Vose, M. (1992), “Modeling Genetic
Algorithms with Markov Chains,” Annals of Mathematics
and Artificial Intelligence, vol. 5, pp. 79-88.
Qi, X. and Palmeiri, F. (1994), “Theoretical Analysis of
Evolutionary Algorithms with Infinite Population Size in
Continuous Space, Part I: Basic Properties,” IEEE
Transactions on Neural Networks, vol. 5, pp.102-119.
Rudolph, G. (1994), “Convergence Analysis of Canonical
Genetic Algorithms,” IEEE Transactions on Neural
Networks, vol. 5, pp. 96-101.
where µ is the mutation probability (0 ≤ µ < 12 ) , χ is the
crossover probability (0 ≤ χ ≤ 1) , the ⊕ represents the
exclusive-or operator for integers, and ⊗ represents the
logical-and operator. Integers are to be regarded as bit
vectors when occurring in
the operation
⋅ , and for a given bit vector r
r denotes the sum of the bit vector’s
coordinates.
REFERENCES
Beyer, H.-G. (1995), “Toward a Theory of Evolution
Strategies: On the Benefits of Sex—the (µ/ µ,λ) Theory,”
Evolutionary Computation, vol. 3, pp. 81-111.
Davis, T. (1991), “Toward an Extrapolation of the
Simulated Annealing Convergence Theory on to the Simple
Genetic Algorithm”, PhD Dissertation, University of Florida
Davis, T., and Principe, J. (1993), “A Markov Chain
Framework for the Simple Genetic Algorithm,”
Evolutionary Computation, vol. 1, no. 3., pp. 269-288.
Eiben, A. E., Aarts, E. H. L., and van Hee, K. M. (1991),
“Global Convergence of Genetic Algorithms: A Markov
Chain Analysis,” in Parallel Problem Solving from Nature
(H.-P. Schwefel and R. Männer, eds.), Springer, Berlin and
Heidelberg, pp. 4-12.
Holland, J. H. (1975), Adaptation in Natural and Artificial
Systems, University of Michigan Press, Ann Arbor.
Iosifescu, M. (1980), Finite Markov Processes and Their
Applications, Wiley, Chichester.
Rudolph, G. (1996), “Convergence of Evolutionary
Algorithms in General Search Spaces,” in Proceedings of
the Third IEEE Conference on Evolutionary Computation,
pp. 50-54.
Rudolph, G. (1997a), Convergence Properties of
Evolutionary Algorithms, Kovac, Hamburg
Rudolph, G. (1997b), “Convergence Rates of Evolutionary
Algorithms for a Class of Convex Objective Functions,”
Control and Cybernetics, vol. 26, pp. 375-390.
Rudolph, G. (1998), “Finite Markov Chain Results in
Evolutionary Computation: A Tour d’Horizon,”
Fundamenta Informaticae, vol. 34, pp. 1-22.
Schwefel, H.-P. (1995), Evolution and Optimum Seeking,
Wiley, New York
Spall, J. C., Hill, S. D., and Stark, D. S. (2000), “Some
Theoretical Comparisons of Stochastic Optimization
Approaches,” in Proceedings of the American Control
Conference, Chicago, IL, June 2000, pp. 1904-1908.
Suzuki, J. (1995), “A Markov Chain Analysis on Simple
Genetic Algorithms,” IEEE Transactions on Systems, Man,
and Cybernetics—B, vol. 25, pp. 655-659.
Vose, M. and Liepins, G. (1991), “Punctuated Equilibria in
Genetic Search”, Complex Systems,vol. 5, pp. 31-44.
Vose, M . (1997), “Logarithmic Convergence of Random
Heuristic Search,” Evolutionary Computation, vol. 4, pp.
395-404
Vose, M. (1999), The Simple Genetic Algorithm, MIT Press,
Cambridge, MA.
Leung, K., Duan, Q., Xu, Z. and Wong, C. K. (2001), “A
New Model of Simulated Evolutionary
93