* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Genetic Algorithm Using SAS/IML
Survey
Document related concepts
Hybrid (biology) wikipedia , lookup
X-inactivation wikipedia , lookup
Group selection wikipedia , lookup
Medical genetics wikipedia , lookup
Behavioural genetics wikipedia , lookup
Koinophilia wikipedia , lookup
Polymorphism (biology) wikipedia , lookup
History of genetic engineering wikipedia , lookup
Genetic engineering wikipedia , lookup
Public health genomics wikipedia , lookup
Designer baby wikipedia , lookup
Heritability of IQ wikipedia , lookup
Genetic testing wikipedia , lookup
Human genetic variation wikipedia , lookup
Genetic drift wikipedia , lookup
Genome (book) wikipedia , lookup
Gene expression programming wikipedia , lookup
Transcript
GENETIC ALGORITHM USING SAS IML® David Steenhard, LexisNexis, Louisville, KY Nan-ting Chou, University ofLouisville, Louisville, KY ABSTRACT The genetic algorithm is a search technique which borrows ideas from natural evolution to effectively find good solutions for optimization problems. What is particularly appealing about the technique is that it is robust at finding good solutions for a large variety of problems. In this paper we provide an introduction to genetic algorithms and use a genetic algorithm written in SAS IML® to solve a combinatorial optimization problem - the traveling salesman problem. Sufficient details are provided to enable the readers to easily use genetic algorithm written in SAS IML® as a tool for solving additional optimization problems. INTRODUCTION A genetic algorithm (GA) is an adaptive search technique based on the principles and mechanisms of natural selection and 'survival of the fittest' from natural evolutions. GA grew out of Holland's 1967 study of adaptation in artificial and natural systems. By simulating natural evolution in this way, a GA can effectively search the problem domain and easily solve complex problems. However, Researchers were uncomfortable with genetic algorithms because of their dependence on random choices; these choices seemed arbitrary and unpredictable. However, in the 1970s, Holland developed a solid theoretical foundation for the technique. His theocy of schemata gives insight into why genetic algorithm work [Goldberg 1989]. The genetic algorithm operates as an iterative procedure on a fixed size population or a pool of candidate solutions. The candidate solutions represent an encoding of the problem in a form that is analogous to the chromosomes of biological systems. Each chromosome represents a possible solution for a given objective function. Associated with each chromosome is a fitness value, which is found by evaluating the chromosome with the objective function. It is the fitness of a chromosome that determines its ability to survive and produce offspring. other traditional techniques. The formulation of the problem can be more flexible: The constraints and the objective function can be non-linear or discontinuous; GAs doesn't find these problems any more difficult than the linear or continuous problems. This is what gives GAs their advantages over traditional methods of optimization. But by using a heuristic technique like a genetic algorithm you can't guarantee an optimal solution. Users must often settle for "near optimal" solutions. These solutions, while usually not perfect, can suffice for a broad range of problems. This paper illustrates how a genetic algorithm written in SAS IML® can be used to solve a combinatorial optimization problem (traveling salesman) and outlines the mechanics of a genetic algorithm in detail. TRAVELING SALESMAN PROBLEM The Traveling Salesman Problem (TSP) is a well-known combinatorial optimization problem. The TSP has been proven to be one large set of problems termed ''NP complete" (nondeterministic polynomial). NP complete problems have no known method of solution better than trying all possibilities. Because such an exhaustive search is generally impractical for more than a few cities, heuristic methods such as genetic algorithms have been applied to find solutions that are acceptable, if not optimal. In this application we used a IS-city tour where the distance measure is the flying distance between the cities (see Appendix B). The following are the cities we used in this application: Atlanta, Anchorage, Chicago, Cleveland, Dallas, Denver, Houston, Kansas City, Los Angeles, Miami, New York City, Phoenix, San Diego, Seattle, and Washington DC. We assigned each city a gene value with a number l through 15 (e.g. Atlanta=l, Anchorage=2, ... , Washington DC= IS). The basic genetic algorithm relies on four genetic operators: selection, crossover, mutation, and replacement. The selection operators use the fitness values to select a portion of the population to be parents for the next generation. Parents are combined using the crossover and mutation operators to produce offspring. This process combines the fittest chromosomes and passes superior genes to the next generation, thus providing new points in the solution space. The replacement operators ensure that the 'least fit' or weakest chromosomes of the population are displaced by more fit chromosomes. The smallest unit of a GA is called a gene. A gene represents a unit of information in your problem domain. For example, in the traveling salesman problem a gene would be the name or a number representation of a city that you are traveling to. A series of genes, or a chromosome, represent one possible complete solution to the problem. Genetic algorithms are different from other normal optimization and search procedures in three ways [Goldberg 1989]. Chromosome I. GAs search from a population of points, not a single point 2. GAs use payoff(objective function) information, not derivatives or other auxiliacy knowledge. 3. GAs use probabilistic transition rules, not deterministic rules. THE BASICS OF A GENETIC ALGORITHM 1~3J10J2J8l1~5l t Gene Unlike many mathematical techniques, solution times with GAs are usually highly predictable. Also, solution time is usually not radically affected as the problem gets larger, which is not always the case with 196 In order to make use of a chromosome the GA needs to decode it and determine how good a chromosome's solution is for a particular problem. This operation is carried out by the objective function. For the TSP the objective function is used to find the minimum distance closed path that visits each city exactly once. The objective function is simply the total distance of the trip. A genetic algorithm first creates a random population of chromosomes where each chromosome is unique. By creating a diverse population instead of a single solution, the GA is tiying many solutions at once. After an initial random population is set up, the GA begins an iterative process of refining the initial solutions so that the better ones are more likely to become the top solution. The OA experiments with new solutions by combining and refining the information in the chromosomes using four basic operations: sekction, crossover, mutation, and replacement. This combining and refining process continues until the genetic algorithm converges to an optimal value. Appendix A illustrates the complete process of a basic genetic algorithm . SELECTION METHODS Selection is the process of choosing two parents from the population for crossing. Determining which individuals should be allowed to interchange their genetic information bas great bearing on the rate of convergence specifically and the success of the search in general. There are numerous different selection techniques used in genetic algorithm software, the ones we used in the SAS IML® genetic algorithm are: roulette, tournament, and remainder stochastic sampling without replacement. Roulette Seleetiou Roulette selection is one of the traditional GA selection techniques. The principle of roulette selection is a linear search through a roulette wheel with the slots in the wheel weighted in proportion to the individual's chromosome fitness value. A target value is generated, which is a random proportion of the sum of the fitness scores in the population. The selection method then iterates through the population until the target value is reached. A fit individual (high fitness score) will have a higher probability of contributing one or more offspring in the next generation. This process is carried out until the desired population size is met. The following SAS IML® code performs roulette selection. start roulette(mate,popsize,fitness,nchrome, iparent); /* Find the cumulative sum of the fitness scores and select a member of the population by using a weighted (by fitness score) random number. */ Remainder Stocbastie Sampling Without Replacement Selection For this method the probabilities of selection are calculated by preselect; =jj jj where jj is the fitness score /L of the ith chromosome. Then the expected number of each chromosome c 1 is calculated c1 = preselect1 • n where n is the total number of chromosomes in the population. Each chromosome is allocated to the population according to the integer part of the c1 values. The fractional parts of the expected number values are treated as probabilities. One by one, weighted coin tosses (Bernoulli trials) are performed using the fractional parts as success probabilities. This procedure continues until the desired population size is reached. CROSSOVER OPERATORS After selection, crossover is then perform on the selected parents (chromosomes). Crossover which occurs in nature, takes two chromosomes and basically swap some of their information gene for gene. The resulting chromosomes, called children, have a piece inherited from each of their parents. Applying crossover to the pairs of chromosomes proceeds by choosing a random number between 0 and 1 to determine whether they cross over. If this random number is greater than a specified crossover probability then crossover is performed. The crossover probability is usually chosen between 0.6 and 0.8 [Mitchell1996]. There are several crossover methods found in the GA literature [Mitchell 1996, Goldberg 1989], but there is no clear method that works the best in all situations. The success or failure of a crossover method depends on complexity of the fitness or objective function, the encoding of the chromosomes, and other details of the GA. The two crossover methods that we used in the SAS IML® genetic algorithm are: fu:ed-position sequential cross(IVer andjixed-position cross(IVer. Traditional crossover methods will not work for order-based problems, since order is not preserved which is needed for the traveling salesman problem. The fixed-position sequential and fixed-position crossover operators are similar to the partially matched crossover (PMX) operator. The following diagrams outline the fixed-point sequential and fixed-point crossover operators. Fixed-Point Sequential Crossover partsum~cusum(fitnessJ; rand=ranuni(O)*(sum(fitness)); k~O; do until((partsum(k,] >=rand) I (k=popsize)); k=k+l; mate==k; end; finish; Tournament Selection Tournament selection is a method where successive pairs of individuals are randomly drawn from the population. After drawing a pair, the chromosome with the higher fitness score is declared the winner, inserted into the new population, and another pair is drawn. This process continues until the desired population size is reached. 197 One at random selected part (consisting of consecutive genes) of parent I chromosome is copied into offspring 2 at exactly the same location. The order and location is preserved. The genes that are not selected from parent I are copied from parent 2 to fill out the empty spaces in offspring 2. The order of the genes in parent 2 is preserved but the exact locations are not. Perform similar steps to find the chromosome for offspring I. ichild[,j)-pmxl; if nchild=O then ichild[,j+l)=pmx2 ; end; nextl: free pmxl pmx2; finish; Fixed-Point Crossover :::: MUTATION OPERATOR Selection according to fitness scores combined with crossover operators gives genetic algorithms the majority of their processing power and the mutation operator plays a secondary role. Mutation rarely occurs in nature and is the result of a miscoded genetic material being passed from the parent to the offspring. This mutation of the chromosome may represent an improvement in the fitness scores, or may cause harmful results. The mutation rate is very low in nature and is usually kept quite low in genetic algorithms. Usually the mutation probability is Pm = I/ population. The following diagram outlines the mutation operator. The genes of parent 1 are selected at random and copied into offspring 2 at exactly the same location as they appear in parent 1. Thus, both order and position of the genes in parent l are preserved in offspring 2. The genes that are not selected from parent l are copied from parent2 to fill out the empty spaces in offspring 2. The orders of the genes in parent 2 are preserved but not the exact locations. Perform similar steps to find the chromosome for offspring l. The following SAS IML® code performs fixed-point crossover. start xoverfp(pcross,n chrome,ichild,ip arent, nchild,j,matel,m ate2); * Mutation Mallen Perfonn fixed position crossover between the randomly selected parents; M1171ii2M 9l4l3l&t-il117l5l211 9!413!&1 B!iaaMlalicn pmxl=repeat(O,nc hrome,l); pmx2=repeat(O,nc hrome,l); ncross=O; if ranuni(O) > pcross then goto nextl; - Mlalicn Select at random one gene from an offspring and swap it with another randomly selected gene in the same offspring. do n=l to nchrome; if ranuni(O) <= 0.5 then do; ncross=ncross+l; pmxl[n,)=iparent [n,matel]; pmx2[n,)=iparent [n,mate2]; end; end; REPLACEMENT MEmODS Replacement is the last stage of the cycle. Two parents are selected from a fixed size population, they crossover fonning one or two children, and there is mutation, however not all children and parents can return to the population, so some of them must be replaced. The technique used to decide which individuals stay in the population and which ones are replaced are very important in the convergence of the genetic algorithm. There are numerous different replacement techniques used in genetic algorithm software, but the ones we use in the SAS IML® genetic algorithm are: both parents, and weakest individual replacement * If there is only one child produced; if ncross > 0 then do; do n-1 to nchrome; check=iparent[n,m ate2); do m=l to nchrome; test=any(pmxl= check); if pmxl[m,)=O & test=O then do; pmxl [m, J =check; goto found; end; end; found: end; Both Parents Both parents replacement is simple, the children replace the parents regardless of their fitness scores. Under this method each individual only gets to breed once. This keeps the population and genetic material moving around * If there is two children produced; Weakest Individual This replacement method replaces the weakest individuals in the population with the children, given the children are fitter. The parents are included in the search for the weakest individuals. This technique rapidly illlproves the overall frtness of the population, and works very w,ell with large populations. if nchild-2 then do; do n=l to nchrome; check=iparent[n, matel]; do m=l to nchrome; test=any(pmx2= check); if pmx2[m,]=0 & test=O then do; pmx2 [m, ) =check; goto foundl; end; end; foundl: end; end; SEARCH TERMINATION The termination, or convergence criteria is what brings the genetic algorithm to a halL The tennination techniques in the SAS IML® genetic algorithm are quite basic, but they work adequately. The following are brief descriptions of 198 the convergence criteria used in the SAS IML® genetic algorithm. interc =0; end; Max Iterations This convergence criterion compares the current iteration to a specified number of iterations. lfthe current iteration is less than the requested number of iterations it continues. Otherwise, it ends execution. I* Find the slope and the intercept that will be used to re-scale the objective scores.*/ else do; if fmin > (fmult*avgsc r- fmax)/(fmult 1) then do; delta-fmax - avqscr; slope-(fmu1t -1)*avgscr/d elta; interc•avqscr *(fmaxfmu1t*avgsc r)/delta; end; else do; delta=avgscr - fmin; slope-avgscr /delta; interc- -fmin*avgsc r/delta; end; end; Terminate Upon Convergence This convergence criteria checks to see if the current best-ofgeneration (best fitness score) is equal to the n previous best-ofgeneration. When using the terminate upon convergence as a stopping criterion you must specify the number of previous generations against which to compare. Terminate Upon Convergence of Average This convergence criteria compares the average fitness scores of the current population with that of the best individual in the population. lfthe population average is within a specified percentage of the best individual's score the algorithm terminates. * Rescale the objective function; fitness= interc + slope*object ; do i=l to popsize; if fitness[i,] < 0 then fitness[i,]-0 ; end; finish; ADVANCED OPERATORS This section discusses a few advanced operators that have been implemented to the basic genetic algorithm over time. The advanced operators discussed here is by no means a exhaustive list of the possible improvements that have been made through the years. The following are the advanced operators we implemented in the SAS IML® genetic algorithm. Sigma Truncation- Linear scaling works well except when negative fitness scores prevent its use. To prevent this sealing problem, Forrest [Forrest 1985] suggested using population variance information to scale the raw fitness scores by the following: Fitness Sealing Fitness scaling has become a widely accepted practice. Scaling is done to keep appropriate levels of competition throughout the optimization process. Without scaling. early on there is a tendency for a few "outlier'' individuals to dominate the selection process. In this case the objective function values or fitness scores must be scaled back to prevent takeover of the population by these extraordinary individuals. Later on when the population has more or less converged, competition among population members is less strong and the experiment tends to wander.In this case the fitness scores must be sealed up to continue to reward the best performers. The most common used fitness sealing methods include: l. 2. !'= f -(j -co-) The constant c is chosen (user specified) as a reasonable multiple of the standard deviation and negative sealed fitness scores (f' < 0) are set to 0. Elitism Elitism is a method that is used to retain the best overall individual at each generation. This individual could be lost if it was not selected for reproduction or destroyed by crossover or mutation. Elitism has been shown to improve performance of genetic algorithms in many different situations. Linear Scaling Sigma {a) truncation Linear sealing- Calculates a scaled fitness score f' from the raw fitness score using a linear equation of the form: !' =af +b to enforce equality of the raw and chosen are b and a The coefficients sealed average fitness scores and cause maximum sealed fitness scores to be a specified multiple of the average fitness score. The fOllowing SAS IML® code performs linear sealing. start lscale(objec t,fitness,pop size,fmult); if fmult <= 1.2 then fmult =1.2; avqscr = sum(object)/ popsize; fmin = min(object); fmax = max(object); if fmax = avqscr then do; slope =1; 199 Overlapping Populations (Steady-State GA) The steady-state genetic algorithm uses overlapping populations with a user specified amount of overlap. For each generation, the algorithm creates new offspring that are added to the population, and then the worst individuals (based on fitness scores) are removed in order to return the population to its original size. Micro-Genetic Algorithm Operation The micro-GA is a technique used on a smaller population size (micro population). It first checks for convergence of this micro population. If converged, it then starts a new micro population with the best individual and fills in the remaining micro population with new randomly selected parents (chromosomes). To determine if the micro population has converged, for two replacement methods is probably due to the fact that we used a constant population size (n=2SO). Worst individual replacement works better with a larger population. each individual in the micro population (excluding the best individual) count the number of genes that are not the same as the best individual. If this total is less than S% of the total number of genes, then consider that the micro population bas converged. The following SAS IML® code performs the micro-GA operation. CONCLUSIONS start gamicro(popsize,nchrome ,iparent, ibest,icount,tsp); rand= repeat(O,nchrome,l); icount =0; do j=l to popsize; do n=l to nchrame; if iparent[n,j) A= ibest[n,) then icount=icount =l; end; end; Genetic algorithms are ideally suited for solving combinatorial optimization problems like the traveling salesman problem. We obtained good results for a 15-city TSP using a genetic algorithm written in SAS IML®. By developing an efficient genetic algorithm one can find good solutions for a large variety of problems and are very helpful when other mathematical techniques break down. REFERENCES /* If icount less than 5% of number of genes, then consider the population to be converged. */ Bagley, J.D. (1967). The behavior of adaptive systems which employ genetic and correlation algorithms. Dissertation Ab.ltracts Inrernational, 28(12),51068. Berry, Michael J. A. and Linoff, Gordon (1997). Data Mining Techniques for Marketing, Sales, and Customer Support. John Wiley & Sons, Inc. Chambers, Lance (editor) (1995). Practical Handbook of Genetic Algorithms. Applications Volume I, CRC Press. diffrac = icount/((popsize-l)*nchro me); if diffrac < 0.05 then do; /* Add the best individual to the new population and randomly select new parents to fill out the new population. */ do n-1 to nchrome; iparent[n,l)=ibest[n,]; end; do j=2 to popsize; do n=l to nchrome; rand[n,)=ranuni(O); end; rcity=randl ltsp[,l]; tt=rcity; rcity[rank(rcity[,l)),)= tt; iparent[,j)=rcity[,l]; end; end; free tt rcity; finish; RESULTS To test the effectiveness of the SAS IML® genetic algorithm and compare the TSP results at different parameter settings we ran 50 simulations for 12 different parameter settings. The twelve different parameter settings consisted of changing the selection methods (roulette, tournament, and remainder stochastic sampling w/o replacement), crossover operators (fixed-point and fiXed-point sequential), and replacement method (both parent and worst individual). The population size (n=250), and fitness scaling (sigma truncation) were held constant for each run. Table I shows the summary statistics of the 600 simulation runs. The optimal solution for this problem is 10,173 miles. In this case, the simulations with tournament selection, fixed position crossover, and both parent replacement performed the best among the 12 combinations. To compare the statistical significance of the average fitness scores between the combinations we conducted a 3x2x2 fuctorial design. Table 2 shows the results of this factorial design. From this table we conclude that all the main effects (selection method, crossover, operator, and replacement method) are statistically different from one another. Table 3 displays all the pairwise comparisons using Fishers LSD. One important note here is that the statistical difference between the 200 Dhar, Vasant and Stein, Roger Stein (1997). Seven Methods for Transforming Corporate Data Into Business Intelligence. Prentice Hall Inc. DeJong, K. A. (1975). An analysis of the behavior of a class of genetic adaptive systems. Dissertation Ab.ltracts International 36( 10) 51408. Forrest, S. (1985). Scaling Fitnesses in the genetic algorithm. In Documentation for PRISONERS DILEMMA and NORMS Programs That Use the Genetic Algorithm. Unpublished manuscript. Gillies, A. M ( 1985). Machine learning procedures for generating image domain feature detectors. Unpublished doctoral dissertation, University of Michigan, Aon Arbor. Goldberg, D. E. (1989). Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley. Mitchell, Melanie (1996). An Introduction to Genetic Algorithms. MIT Press. SAS Institute Inc. (1999). SASIIML User's Guide, Version 8, Cary NC: SAS Institute Inc. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the authors at: David Steenbard LexisNexis 7107 Mallgate Place Louisville, KY 40207 Work Phone: (502) 721-8502 E-mail: david.steenhard@!exis-nexis com SAS is a registered trademark of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. SAS IML® GENETIC ALGORITHM PROCESS (APPENDIX A) Step 1 Randomly initialize a population of parent chromosomes (of size n) ... 1 Stepl Evaluate each chromosome in the population by calculating their fitness score (perfonn fitness scaling if desired) Step3 Select parent chromosomes for reproduction based on their fitness score. • _t_ Step4 Stop and return the best chromosome (solution set) StepS Yes Step6 Apply crossover and mutation as the parent chromosomes reproduce • Replace selected members of the population (based on replacement method) with the new members. ~ No or maximum generation reached FLYING DISTANCE BETWEEN 15 U.S. CITIES (APPENDIX B) Adam Anfuag:: Anfuag:: 599 <le\elmxi 548 2,1111 3,030 llllla; 72!J 1,190 2,998 2;?Rl 680 3,218 1,7Jl 1,311 3,945 Mllli NYC Phlenix Smlleg:l Seattle oc ~ llllla; Il:nver H:liSal KC lA Mllli NYC PIJJeoix Sat lleg:l Seattle 3,36S OJica&o Il:nver H:liSal KC lA ClJil:¥ 670 1,917 S89 751 1,562 1,862 1,149 540 3,322 2,515 2,417 1,416 3,322 310 i92 886 915 398 1,718 1,183 1ll 1,418 1,(98 1,@3 ro2 um 1,194 l,f117 681 2,022 1,00! 412 1,711 1,995 1,989 284 636 222 44S 1,216 1,106 1,368 8SS 1,152 1,636 1,174 853 S43 837 1,8)2 1,606 582 828 1,003 1,464 62S 1,358 950 1,348 t;m 1,~ 994 1,283 l,&l8 1,191 1,031 1,317 1,478 930 For a 15-city tour there are 151 (1,307,674,368,000), possible tours and 15!130 (43,589,145,600) of the tours are unique. 201 1,223 1,317 2,432 36S 108 942 1,275 l,Cl!4 1,942 2,233 1,685 910 2,116 1,403 2,375 211 300 1,002 1,948 1,037 2,2111 2,292 SAS IML® GENETIC ALGORITHM RESULTS (APPENDIX C) Table 1: Genetic Algorithm 15-City Tour Overall Performanee 10,408 10,505 10,415 10,655 10,635 10,769 10,445 10,616 10,412 Stochastic w/o Replacement Stochastic w/o Replacement Stochastic w/o Replacement 10,381 10,381 10,358 10,655 10,531 10,681 10,412 10,445 10,358 11,033 11,224 11,116 11,636 11,505 12,671 11,631 12,036 11,454 38% 26% 36% 18% 18% 14% 32% 28% 40% 16% Table 2: Full Three-Factor Factorial Design Model Source Selection Method Crossover Operator Selection*Crossover Replacement Method Selection*Replaeement Crossover*Replaeement Selection*Crossover*Replaeement Error corrected Total Table 3: Pair wise Comparisons GA Parameter Settings Comparisons Selection Method Comparison Tournament - Roulette Tournament - Remainder Stochastic w/o Replacement Remainder Stochastic w/o Replacement - Roulette Crossover Ooerator Comparision Fixed Point Crossover - Fixed Point Sequential Renlaeement Method Comparision Both Parents - Worst lndividual DF AnovaSS 4,018,841 1,420,872 367,215 5,310,628 545,910 6,977 569,883 76,258,973 88,499,299 2 1 2 1 2 1 2 588 599 Ddterenee Between Means Mean Sguare 2,009,420 1,420,872 183,608 5,310,628 272,955 6,977 284,942 129,692 95% Confidence Limits -195.49 -136.22 -59.27 -266.48 to -124.49 ••• -207.21 to -65.23 ••• -130.26 to 11.73 -97.33 -155.29 to -39.36 ••• -188.16 -246.13 to -130.19 ••• *** Significant at 0.05 level. 202 FValue 15.49 10.96 1.42 40.95 2.10 0.05 2.20 Pr>F 0.0000 0.0010 0.2436 0.0000 0.1228 0.8167 0.1120