Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Computable Rate of Convergence in Evolutionary Computation* David R. Stark The Johns Hopkins University Applied Physics Laboratory 11100 Johns Hopkins Road Laurel, Maryland 20723-6099 U.S.A. [email protected] James C. Spall The Johns Hopkins University Applied Physics Laboratory 11100 Johns Hopkins Road Laurel, Maryland 20723-6099 U.S.A. [email protected] Abstract. The broad field of evolutionary computation (EC)—including genetic algorithms as a special case—has attracted much attention in the last several decades. Many bold claims have been made about the effectiveness of various EC algorithms. These claims have centered on the efficiency, robustness, and ease of implementation of EC approaches. Unfortunately, there seems to be little theory to support such claims. One key step to formally evaluating or substantiating such claims is to establish rigorous results on the rate of convergence of EC algorithms. This paper presents a computable rate of convergence for a class of ECs that includes the standard genetic algorithm as a special case. Key words: Genetic algorithms; Markov chains; stochastic optimization 1. INTRODUCTION Evolutionary computation (EC) algorithms have attracted much attention for solving search and optimization problems. The fundamental characteristics of EC algorithms are the propagation of a population of potential solutions and the use of some Monte Carlo form of simulated evolution. The selection scheme is such that the most favorable elements of the population tend to survive into the next generation while the unfavorable elements tend to perish. There have been many reported successes of specific EC algorithms, most notably various version of the genetic algorithm (GA), introduced in modern form in Holland (1975). What has been lacking in much EC work, however, is a theoretical foundation for characterizing the performance. This paper develops a method for ISIF © 2002 characterizing the rate of convergence of a class of EC algorithms, including the standard GA. The rate presented here provides a quantifiable measure of the performance of ECs for a range of possible implementations. The basic optimization problem considered here * corresponds to finding an optimal point θ : * θ = arg min L (θ ) (1.1) θ∈D where L(θ) is the loss function to be minimized, D is the domain over which the search will occur, and θ is a pdimensional (say) vector of parameters. Historically, there have been three general approaches in EC, namely evolutionary programming (EP), evolutionary strategies (ES) and the above-mentioned GAs. The principle differences in the three approaches are the selection of evolutionary operators used to perform the search and the computer representation of the candidate solutions. EP uses selection and mutation only to generate new solutions. While both ES and GA use selection, recombination and mutation, recombination is used more extensively in GA. A GA traditionally performs evolutionary operations using binary encoding of the solution space, while EP and ES perform the operations using real-coded solutions. The GA also has a real-coded form and there is some indication that the real-coded GA may be more efficient and provide greater precision than the binary-coded GA. The distinction among the three approaches has begun to blur as new hybrid versions of EC algorithms have arisen. The formal convergence of EC algorithms to the * optimal θ has been considered in a number of references. Eiben, et al. (1991) derived a convergence-in-probability result for a GA with elitist selection using Markov chain 88 theory assuming a finite search space. This result characterized the convergence properties of the GA in terms of the selection, mutation, and recombination probabilities. Rudolph (1994) analyzed the canonical GA without elitist selection in the binary search space. He found that the canonical GA will not, in general, converge to the global optimum, and that convergence for this GA comes only by saving the best solution obtained over the course of the search. For function optimization it makes sense to keep the best solution obtained over the course of the search, so based on the analysis by Rudolph (1994) global convergence is guaranteed. Note that convergence results in a finite domain D based on binary-bit representations are per se of limited value. Because there are a finite number of points to search, naïve approaches such as random search and simple enume ration are also guaranteed to converge. However, since convergence is a precondition for convergence rate calculations, convergence results in a finite search space are not entirely meaningless. Rudolph (1998) summarizes the sufficient conditions on the mutation, recombination, and selection operators for global convergence of EC algorithms in finite search spaces, with a simplified mathematical structure that does not rely on Markov chain theory. A relatively recent survey of EC convergence theory is in Rudolph (1997a). Global convergence results can be given for a broad class of problems, but the same can not be said for convergence rates. The mathematical complexity of analyzing EC convergence rates is significant. Determining specifically how many generations of the population are required in order to ensure a certain error in the solution is apparently an open problem for arbitrary loss functions. Vose (1997, 1999) showed that with an infinite population size, and for every 0 < δ < 1, the number of generations required for the GA to come within a distance δ of θ∗ is O(−log δ). Aside from the infinite population size requirement, this result is not directly usable in our comparison however since it does not give a quantifiable expression for the number of generations required to guarantee that the best population element will be within * some δ distance of θ . Other rate of convergence results exist, but these are generally very restrictive (typically applying when only two of the three standard EC operations—selection, mutation, and recombination—apply and/or when the algorithms are being used on only the simple spherical loss function L(θ) = ||θ|| 2 ). See, for example, Qi and Palmeiri (1994), Beyer (1995), or Rudolph (1997a, 1997b). Recently Leung et. al. (2001) have developed a model for Evolutionary Algorithms that provides convergence and convergence speeds for nonelitist EC algorithms. Their model does not however provide a computable rate of convergence. Our approach differs from those above. We initially cast the chosen EC algorithm as a Markov process. Then using information about the transition probabilities available by user-specified coefficients associated with selection, mutation, and recombination, we develop a steady state distribution for the EC population that applies in the limit as the number of generations increase. A rate at which the finite-generation distribution of the population converges to the stationary distribution is given in Suzuki (1995). Finally, to compensate for the inaccuracy of the stationary distribution in finite samples, we present an adjustment that allows for the approximate calculation of the population distribution at any finite generation. It is expected that such knowledge will provide users key knowledge on the expected performance of a given EC implementation before the user invests significant resources in a practical implementation. It is also expected that this result will allow for a generalization of the algorithm comparison carried out in Spall, et al. (2000). Section 2 establishes some notation associated with our EC setting and presents the matrix of transition probabilities for the simple GA, and presents the steadystate distribution of the GA in terms of the GA parameters. Section 3 presents a path toward using the stationary distribution to form a computable bound on the rate of convergence. The appendix presents some details related to the Markov chain analysis of the GA. 2. GA MARKOV CHAIN BACKGROUND Recall the basic problem in (1.1). While there are numerous variations in the GAs used by practitioners, in the traditional GA the nominally real vector of parameters is converted to its binary equivalent, so that the search space is the collection of binary strings of length l. Thus there are 2l possible strings. Since the search space for a GA is nominally over a finite collection of binary strings, the candidate solutions may also be represented by a collection l of integers, r ∈ 0, 1, 2, ..., 2 − 1 . Each value of r represents a unique label for one of the 2l possible strings. The number of candidate solutions at each generation is called the population size and is denoted by Npop . The total number of possible populations is given by { } + 2l − 1 pop . N= (2.1) ( 2l − 1)! N ! pop An initial population of candidate solutions is selected at random. The basic GA search proceeds as follows: N 1. 89 The fitness of each of the Npop candidate solutions in the population is determined from the fitness function (the 2. 3. 4. 5. loss function is usually transformed into a fitness function). Based on the fitness of the candidate solutions, the parents for the next population are selected. Some form of reproduction among the parents takes place. This is usually called crossover and in its simplest form two parents are selected with some probability for crossover (called the crossover rate) and a random position in the parents’ binary string is chosen. The offspring of the two parents are formed by exchanging the tails of the binary strings at the randomly selected position. Mutation corresponds to flipping of bits of an individual with some small probability (called the mutation rate). The candidates left after these four steps form the gene pool for the next population. The cycle repeats through successive generations until some stopping criterion is reached. The GA may be modeled as a Markov chain so that all of the tools of Markov chain theory may be used to explore the properties of GAs. Nix and Vose (1992) showed that the stochastic transition through these genetic operations can be described completely by the transition matrix for one generation. If we assume that the GA operates on each generation to produce the next, then the N × N Markov Chain transition matrix Q has elements Qi , j = conditional probability that population j is generated from population i where each possible individual l r ∈ 0,1,2,...,2 − 1 is included s(r,i) times in population i. { } As shown by Nix and Vose (1992) the elements of the transition matrix give the probability of producing population j from population i and may be written as a multinomial distribution given by 2l −1 [π ( r )] s ( r , j ) i Qi, j = N pop ! ∏ , s( r, j)! r=0 there is no mechanism for preserving the best solution into the next generation the best solution will typically be lost through the recombination and mutation operations. Rudolph (1998) shows that a GA that ensures that the best solution in the population will survive the selection process will converge globally. Suzuki (1995) derives Q for a modified elitist GA that satisfies the criteria for convergence. In the modified elitist GA, Npop −1 candidate solutions are produced at each generation and the candidate solution with the highest fitness from the previous generation is assured to survive. In the sequel we will be using this form of the GA. The reader should refer to the appendix for a complete description of Q in terms of the GA parameters for the modified elitist algorithm. Of primary interest in analyzing the performance of GA algorithms using Markov chains is the stationary distribution of the GA. Davis (1991) shows that the stationary distribution for the simple GA with both mutation, crossover and selection operators exists, and it is unique. Let q k be a 1 × N row vector having jth component, q k(j) equal to the probability that the kth generation will result in population j. From the definition of the transition matrix, q k = q 0 Qk where q 0 is an initial probability distribution. The stationary distribution of the GA is then given by k q = lim q k = lim q 0Q . (2.3) k→∞ k →∞ From this we see that q is a solution to q = qQ. If we can solve directly for q then we will have the stationary distribution of the GA in terms of the GA parameters, since Q is completely specified in terms of the GA parameters through the πi ( y ) . The method of solving for q k(j) is well known (Iosifescu, 1980, pp. 123-124). q( j ) = (2.2) where πi ( r ) is the probability of producing string r from population i. Clearly, πi (r ) depends upon the algorithm parameters, namely the mutation rate, convergence rate and the type of selection used (which normally depends upon the objective function). The task of computing πi (r ) is formidable for most forms of the GA. Fortunately, however, this problem was solved for the canonical GA by Nix and Vose (1992) using the model of the GA developed from Vose and Liepins (1991). Unfortunately this form of the canonical GA does not converge to the global optimum, as shown by Rudolph (1994). The canonical GA typically finds the global solution at some generation(s), but because Dj (2.4) N −1 ∑ D i =0 i where Dj is the jth cofactor of the main diagonal entries of the matrix I − Q. The cofactors Dj are found by taking the determinant of the matrix obtained from I − Q by removing the jth row and column. 3. THE STATIONARY DISTRIBUTION APPLIED TO CONVERGENCE ANALYSIS The analysis in Sections 2 provides a means for obtaining the stationary distribution for the population elements of a GA. This analysis, however, does not directly answer a question such as “What is the likelihood of having at least one population element correspond to a solution θ in 90 * the search space of interest close to the optimal θ ?” There are two reasons why this question is not directly answered in the previous analysis: (1) The prior analysis provides limiting probabilities for the bit-based representation of the GA, not the corresponding floating point representation that is of direct physical interest in solving problem (1.1) and (2) The stationary probability for one generation does not directly describe the probability of obtaining a “good” solution in at least one generation across a number of iterations. This section addresses these issues in developing an approach that applies in a realistic GA running multiple generations. Before proceeding, note that the use of the stationary distribution is justified under the convergence result in Suzuki (1995). In particular, for the regular GA of interest here, Suzuki (1995) presents results related to the stationary probabilities associated with the likelihood of realizing “good” specific population elements in a population. It is shown that the probability that the population contains the individual with the highest fitness value (corresponding to the θ with the lowest loss value) approaches 1 − O(a n ) where 0 < a < 1. * corresponding to θ ∈ S( θ ) first meets or exceeds 1 − ρ, we have met the requirement for “satisfactory convergence.” Suppose that we count the possible multiple evaluations of the fitness function at a particular bit string that result through the operations of a GA (i.e., the same bit string appears more than once in the course of the GA operations) as separate function evaluations. This is usually the way function evaluations are counted in a GA because it is not typically worth the software overhead to store a function evaluation for future use. Then, the number of function evaluations needed in response to question A above is Npop +(K − 1)×(Npop − 1), where we assume that the initial generation computes Npop fitness values and all subsequent generations compute only Npop − 1 fitness values (due to the preservation of the best value from one generation to the next). 4. ILLUSTRATION A short example will now follow that illustrates the convergence calculation for a modified genetic algorithm. Consider the following loss function from Schwefel (1994): There are obviously many ways one can express the rate of convergence. We will address the rate of convergence by focusing on the question: L (θ ) = − θ sin ( θ ) , θ∈[0,15] as shown in Figure 1. In this example there is a local minimum. The optimal solution is on the boundary at θ?? ?? ?Modified Genetic Algorithms with the following evolutionary parameters will be used to find the minimum of the loss function. Crossover, Mutation, Proportional Selection, and Population Size. With some high probability 1− ρ (ρ a small number), how many L(⋅) function evaluations are needed to * achieve a solution θ lying in some “satisfactory set” S( θ ) * containing θ ? Note that the above question is in terms of the basic parameters θ and associated loss function. Although the GA discussion in Sections 2 and 3 is based on the traditional notions of bit-based population elements and a fitness function, there is no fundamental barrier to transitioning between the formulations. In particular, there is a unique mapping between the possible values for θ and the “labels” r associated with each of the 2l possible bit representations. Given the mapping between θ and r, it is possible to define * a set equivalent to the satisfactory set S( θ ) such that when * r lies in this set, then θ ∈ S( θ ). Let this equivalent set be Sr. Given q k computed according to Section 2, we are in a position to answer question A. First, we identify the population elements that correspond to solutions acceptably * close to θ . Then, for any k, we can compute q k, giving the probabilities associated with the populations containing * those bit-based elements corresponding to θ near θ . When we identify a K such that the sum of the probabilities in q K The convergence rate results for three cases with varying evolutionary parameters are shown in Table 1. It should be noted that computing convergence rates for multidimensional, large population-size evolutionary computation problems using this method is not practical due to the very large transition matrix Q that would result as suggested by the number of possible populations calculation l ( N pop + 2 − 1)! equation N = . A perhaps more intuitive l (2 − 1)! N pop ! estimate of the size of N can be obtained by Stirling’s Approximation as follows: 2 π 1 + N pop l 2 −1 N pop 2l −1 N pop 1 + 2 l − 1 1 l + 2 −1 1/2 N pop 1 . The rate of growth of N is clearly dominated by the first two terms in the expression above. 91 APPENDIX 0 A complete specification for the transition matrix Q for the modified elitist algorithm as detailed in Suzuki (1995) is provided in this appendix. -2 -4 1. -6 -8 -10 -12 0 5 10 2. 15 Figure 1. Loss function for use in example. Local minimum at θ=5; global minimum is at θ=15. 3. Table 1. Convergence Rate Results for Sample Loss Function Probability that the Evolutionary Algorithm has a population that contains the optimal solution. Generation of Evolution Algorithm Crossover Rate=1.0 Mutation Rate= 0.05 Npop = 2 l =6 Crossover Rate=1.0 Mutation Rate= 0.05 Npop = 4 l =4 Crossover Rate=1.0 Mutation Rate= 0.05 Npop = 2 l =4 0 5 10 15 20 30 In the modified elitist strategy, the probability Qk,v of the transition from the population k to the population v is given as 40 Qk ,v 0.0 0.2 0.1 0.5 0.2 0.7 0.3 0.8 0.4 0.9 0.7 1.0 Determine the priority prty(r) = 1, 2, ..., 2l for each possible individual r according to the descending order of the fitness value, Λ(r) and a predetermined tie-breaking rule when more than one individual have the same fitness. The fitness function Λ(r) is the negative of the loss function L(θ), where the fitness function has domain as the set of integers and the domain of the loss function is the set of real numbers. Assign the identification k=1, 2, ..., N to each population according to the ascending order of the minimum number of prty(r* (k)) and to a predetermined tie-breaking rule when more than one population have the same highest priority prty(r* (k)), where r* (k) is the individual with the highest priority included in each population k; and The modified elitist strategy reserves the individual r* (k) with the highest priority prty(r* (k)) in the population till next generation. 2l −1 π ( j , k )Y ( j , v ) ∗ ∗ , if prty ( r ( v )) ≤ prty ( r ( k )) ( N pop − 1) ! ∏ = Y ( j ,v ) ! j= 0 ∗ ∗ 0 , ifprty ( r ( k )) < prty ( r ( v )) 1.0 1.2 where the individual j occurs s(j,k) times in population k and s(j,v) times in population v and ∗ s ( r , v ), if r ≠ r (k ) Y (r , v ) = ∗ s ( r , v ) − 1, if r = r (k ) 2l −12l −1 π( j , k ) = ∑ ∑ ϕ (i ⊕ j, k ) ϕ( h ⊕ j, k ) Ri, h i= 0 h= 0 0.1 0.2 0.3 0.4 0.5 0.7 1.0 ϕ( i, k ) = 92 Λ ( i ) s( i, k ) l 2 −1 ∑ Λ ( h ) s (h ,k ) h= 0 Ri ,h = 1 2 l−i i l− h h (1 − χ) (1 −µ ) µ + (1 − µ) µ l −1 (2 υ −1)⊗h + i − ( 2υ −1)⊗i χ l −1 ∑ µ υ=1 l − i + (2 υ −1)⊗ i − (2υ −1)⊗h 1 + υ υ 2 χ l −1 (2 −1)⊗i +h − ( 2 −1)⊗h + ∑ µ l −1υ=1 l − h + (2υ −1)⊗h − (2 υ −1)⊗i ⋅(1−µ )⋅(1−µ) ComputationConvergence Analysis and Specifications,” IEEE Transactions On Evolutionary Computation, vol., 5 no. 1, February 2001 Nix, A. and Vose, M. (1992), “Modeling Genetic Algorithms with Markov Chains,” Annals of Mathematics and Artificial Intelligence, vol. 5, pp. 79-88. Qi, X. and Palmeiri, F. (1994), “Theoretical Analysis of Evolutionary Algorithms with Infinite Population Size in Continuous Space, Part I: Basic Properties,” IEEE Transactions on Neural Networks, vol. 5, pp.102-119. Rudolph, G. (1994), “Convergence Analysis of Canonical Genetic Algorithms,” IEEE Transactions on Neural Networks, vol. 5, pp. 96-101. where µ is the mutation probability (0 ≤ µ < 12 ) , χ is the crossover probability (0 ≤ χ ≤ 1) , the ⊕ represents the exclusive-or operator for integers, and ⊗ represents the logical-and operator. Integers are to be regarded as bit vectors when occurring in the operation ⋅ , and for a given bit vector r r denotes the sum of the bit vector’s coordinates. REFERENCES Beyer, H.-G. (1995), “Toward a Theory of Evolution Strategies: On the Benefits of Sex—the (µ/ µ,λ) Theory,” Evolutionary Computation, vol. 3, pp. 81-111. Davis, T. (1991), “Toward an Extrapolation of the Simulated Annealing Convergence Theory on to the Simple Genetic Algorithm”, PhD Dissertation, University of Florida Davis, T., and Principe, J. (1993), “A Markov Chain Framework for the Simple Genetic Algorithm,” Evolutionary Computation, vol. 1, no. 3., pp. 269-288. Eiben, A. E., Aarts, E. H. L., and van Hee, K. M. (1991), “Global Convergence of Genetic Algorithms: A Markov Chain Analysis,” in Parallel Problem Solving from Nature (H.-P. Schwefel and R. Männer, eds.), Springer, Berlin and Heidelberg, pp. 4-12. Holland, J. H. (1975), Adaptation in Natural and Artificial Systems, University of Michigan Press, Ann Arbor. Iosifescu, M. (1980), Finite Markov Processes and Their Applications, Wiley, Chichester. Rudolph, G. (1996), “Convergence of Evolutionary Algorithms in General Search Spaces,” in Proceedings of the Third IEEE Conference on Evolutionary Computation, pp. 50-54. Rudolph, G. (1997a), Convergence Properties of Evolutionary Algorithms, Kovac, Hamburg Rudolph, G. (1997b), “Convergence Rates of Evolutionary Algorithms for a Class of Convex Objective Functions,” Control and Cybernetics, vol. 26, pp. 375-390. Rudolph, G. (1998), “Finite Markov Chain Results in Evolutionary Computation: A Tour d’Horizon,” Fundamenta Informaticae, vol. 34, pp. 1-22. Schwefel, H.-P. (1995), Evolution and Optimum Seeking, Wiley, New York Spall, J. C., Hill, S. D., and Stark, D. S. (2000), “Some Theoretical Comparisons of Stochastic Optimization Approaches,” in Proceedings of the American Control Conference, Chicago, IL, June 2000, pp. 1904-1908. Suzuki, J. (1995), “A Markov Chain Analysis on Simple Genetic Algorithms,” IEEE Transactions on Systems, Man, and Cybernetics—B, vol. 25, pp. 655-659. Vose, M. and Liepins, G. (1991), “Punctuated Equilibria in Genetic Search”, Complex Systems,vol. 5, pp. 31-44. Vose, M . (1997), “Logarithmic Convergence of Random Heuristic Search,” Evolutionary Computation, vol. 4, pp. 395-404 Vose, M. (1999), The Simple Genetic Algorithm, MIT Press, Cambridge, MA. Leung, K., Duan, Q., Xu, Z. and Wong, C. K. (2001), “A New Model of Simulated Evolutionary 93