* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Work1
Group selection wikipedia , lookup
Human genome wikipedia , lookup
Koinophilia wikipedia , lookup
Transgenerational epigenetic inheritance wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Genetic drift wikipedia , lookup
Frameshift mutation wikipedia , lookup
Genomic library wikipedia , lookup
Point mutation wikipedia , lookup
Microevolution wikipedia , lookup
Genome evolution wikipedia , lookup
Assignment 1 Submitters: Arie Kozak 314346024 Amir Patoka 041857178 Question 1 Problem description The goal is to solve Ackley Problem [ACK87], which is defined as finding the minimum point of the following function for n = 3: The search should be performed in (-32.768, 32.768) Genome Representation As defined in genetic algorithms, genome representation is simply a bit string. I simply used Java type “String” filled with “0” or “1” for this. The genome represents the phonotype which is a vector of 3 numbers of type double (the Xi for the function) as a chaining of the bit strings with equal length, each such string represents a real number and calculated in following way: X = N * (B – A) / (2 ^ L – 1) + A Where: X – the real number in the phenotype N – the value of the bit string as a (natural) binary number A - -32.768 B - 32.768 L – length of the bit string Fitness calculation The fitness is a real number (double in Java) and calculated in the following way: F = 20 + e – F(x) So that fitness will be always positive (yes, it’s not very important), and higher fitness indicates better candidate because it assures lower value for F(x). Process of work, conclusions and results Results are written in the following format: first the input parameters for the algorithm, next the plot of best and average fitness, and finally the result. Representation length – is a number of bits for each X in the phenotype (so the length of the genome is 3 times this number). Pm and Pc is the probabilities as defined in the genetic algorithms. Average fitness is the average fitness of the last generation. The phenotype presented as vector of 3 numbers in “[]”, and finally the “f value” is the F(best phenotype). In the beginning I started with lower representation length – 16 seemed to be sufficient. Pc was chosen to be 1, as experiments with different values showed, the cross-over is a good thing, it increases diversity with little damaging effect, so that lower mutation rates can be used, which more damaging. Population size does bring slightly better results (when the rest of the parameters are the same), but increases run time, and better use for the run time is increasing the number of generations instead, so 100 was sufficient value. Mutation rates were changed from one run to another. Higher mutation rates increase chance for “lucky” best fitness, but reduce the average fitness, so that the gap between average fitness and best fitness increases. So I used higher mutation rates for smaller number of generations, that showed better results because of “lucky” guesses, but not too high – 0.01, higher than that was destructive. And for the higher number of generations, the average fitness seems to be more important, as quality of population builds up in time for better fitness. First run Representation length: 16 Pm (probability for mutation): 0.01 Pc (probability for cross-over): 1.0 Population size: 100 Number of generations: 100 25 20 15 Avergage fitness Best Fitness 10 5 0 1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 Best phenotype: [-0.013500205996798798, -0.015500236514839116, 0.027500419623102346]; Fitness: 22.61818603229005 Average fitness: 16.801908295479414 f value: 0.10009579616899389 Both average fitness and best fitness rising through the whole graph which is good indication – better values will be found for higher number of generations, so the parameters are not changed in the next run. Run 2 Representation length: 16 Pm (probability for mutation): 0.01 Pc (probability for cross-over): 1.0 Population size: 100 Number of generations: 500 25 20 15 Avergage fitness Best Fitness 10 5 0 1 30 59 88 117 146 175 204 233 262 291 320 349 378 407 436 465 494 Best phenotype: [0.003500053406575887, -0.008500129701687342, 0.008500129701687342]; Fitness: 22.686587031235483 Average fitness: 17.10398251808778 f value: 0.031694797223562166 We are getting better results from before, so the representation length should be increased to (20) allow better accuracy. Run 3 Representation length: 20 Pm (probability for mutation): 0.01 Pc (probability for cross-over): 1.0 Population size: 100 Number of generations: 2000 25 20 15 Avergage fitness Best Fitness 10 5 0 1 147 293 439 585 731 877 1023 1169 1315 1461 1607 1753 1899 Best phenotype: [0.00246875235438182, -0.002718752592805629, 0.00259375247360083]; Fitness: 22.707539978982076 Average fitness: 16.66716814436697 f value: 0.010741849476968657 The best fitness is still improved, but average fitness is declined, the mutations effect seems to be devastating, therefore it’s reduced to 0.001. Run 4 Representation length: 20 Pm (probability for mutation): 0.0010 Pc (probability for cross-over): 1.0 Population size: 100 Number of generations: 5000 25 20 15 Avergage fitness Best Fitness 10 5 0 1 366 731 1096 1461 1826 2191 2556 2921 3286 3651 4016 4381 4746 Best phenotype: [0.004718754500160571, 0.0017812516987376625, 4.06250387428031E-4]; Fitness: 22.70614157387505 Average fitness: 22.037964018920682 f value: 0.012140254583993926 Average fitness is very good with lower mutation rate and very close to the best fitness. The graph seems to be permanent and doesn’t improve after high number of generations, just the improvements in the beginning. May be there are possible improvements for the future, or may be there are not. But since there is a slight improvement, representation length is increased (to 21) again. Run 5 Representation length: 21 Pm (probability for mutation): 0.0010 Pc (probability for cross-over): 1.0 Population size: 100 Number of generations: 10000 25 20 15 Avergage fitness Best Fitness 10 5 0 1 896 1791 2686 3581 4476 5371 6266 7161 8056 8951 9846 Best phenotype: [-0.001328125633293098, -7.65625365083622E-4, 0.0025781262293520513]; Fitness: 22.711195137059793 Average fitness: 22.059355966696174 f value: 0.007086691399252221 The graph continues to be stable... So this is the best phenotype found until now and is very close to the solution which is [0, 0, 0]. Which is very good result for stochastic algorithm. The diversity was never a problem here (which is not very much needed for this specific problem). The diversity graphs of the runs look like this (this is the graph of the last generation of the last run): Count 2.5 2 1.5 Count 1 0.5 0 0 5 10 15 20 25 The x axis represents the fitness, the y axis represents the number of phenotypes with this fitness. There are almost no phenotypes with the same fitness, though there are many with close fitness. And different fitness means different phenotype (and genotype since the mapping between them is 1:1) with high probability. Source code AckleyProblem.java Question 2 The following may help to unstuck: - Increasing mutation rate. - Increasing cross-over rate / more diversity with cross-over. - Just waiting more time for more generations which might bring a “jump” to the next fitness level. - Using different selection technique. For example the fitness-proportionate selection is known to cause divergence of population and eliminating diversity. There are some ways to counter that (not learned yet). - Increasing population size, too small population has limited diversity and can cause to be stuck at local maximum. - Introducing new (possibly random) phenotypes occasionally to the population. Question 3 Problem Description: Implement a genetic algorithm to solve a Maximum Clique problem, finding the maximum (vertex wise) sub-graph which is a complete graph. Process of Work: - Genome representation: We chose to represent our genomes with an array containing boolean values. Array[i] == true means that the I'th vertex in the source graph is included in the clique the genome represents. - Fitness: The fitness is the amount of flagged values in the genome. The flagged cells in it participate in the sub-graph. Since the genome is a clique the fitness directly represents a clique size. - Genetic Operators: o Selection: We chose to use fitness-proportionate selection with roulette-wheel sampling. o Cross-Over: We used a one point cross-over, with crossover rate of 0.7. There were two crossover we tried. The first only tied to add vertex to the genome after the cross-over point (the purpose was to find a less distractive cross-over). The second removed the original vertex after the cross-over point in the child and then tried to add vertex to the genome from the other child after the cross-over point. o Mutation: We used mutation rate of 0.001, 0.01, 0.1. The mutation flipped vertex out of the genome, and tried to flipped vertex also in to the genome if it created a legitimate clique. Running Results: - hamming6-2.clq: best result was 32. clique=1,4,6,7,10,11,13,16,18,19,21,24,25,28,30,31,34,35,37,40,41,44,46,47,49,52,54,55 ,58,59,61,64 clique=1,4,6,7,10,11,13,16,18,19,21,24,25,28,30,31,34,35,37,40,41,44,46,47,49,52, 54,55,58,59,61,64 clique=1,4,6,7,10,11,13,16,18,19,21,24,25,28,30,31,34,35,37,40,41,44,46,47,49,52, 54,55,58,59,61,64 - c-fat500-1.clq: best result was 14. Clique=11,12,91,92,171,172,251,252,331,332,411,412,491,492 Clique=12,13,92,93,172,173,252,253,332,333,412,413,492,493 Clique=12,13,92,93,172,173,252,253,332,333,412,413,492,493 - p_hat500-1.clq: best result was 9. Clique=47,69,71,107,148,242,266,279,408 Clique=47,69,71,148,242,248,266,279,412 Cli que=47,69,71,148,248,266,279,412,489 Conclusions: All in all most of the time the better we emphasized exploration over exploitation, the most obvious one is that in both the first and second graphs better results were achieved through the use of 0.01 mutation and in the third through the use of 0.1 mutation. Further more the better cross-over was the more destructive one, which further tends to be more explorative than the other cross-over used. The most apparent reason for this is the fact that there is no connection between close values in the genotype string like in the phenotype itself, the vertex that participate in the clique don't have to group together in their indexed order. Usage (needs JFreeChart): javaw -jar ecal_1_4.jar <clq file> <mutation probability> <cross-over probability>