Download Genetic Relatedness Based Selection

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
1
Genetic Algorithms
Jacob Noyes
[email protected]
Department of Computer Science
University of Wisconsin – Platteville
4/19/2013
Abstract
Genetic algorithms are specialized search heuristics which use the fundamental principles of evolution
through natural selection to find the best solution to a problem. The process populates a group of
random potential solutions to a given problem, selects a specific subset of optimized solutions from the
initial population, combines the traits of the selected solutions, injects mutations, and ends with a new
generation of potential solutions. This process is then repeated until a certain threshold is met. This
paper goes over the basic processes which make up a genetic algorithm, their many variations, how to
use them, and how they relate to principles found in the natural world.
Introduction
Genetic algorithms are a robust, wide-ranging, and simple-to-understand answer to large, complex
search spaces. Many search algorithms may match or beat genetic algorithms in efficiency when it is
plausible to inspect every possible outcome, but genetic algorithms take over when the possible
combinations of parameters exceed the testing capacity. Genetic algorithms can then be used to find
the best fit for the situation using a system designed to mimic the process of evolution through natural
selection.
2
Genetic algorithms are designed from the ground up with evolution in mind. The process is broken
down into four stages: initialization, selection, crossover, and mutation. Initialization begins the
process by generating a population of random parent guesses. Selection determines how fit each parent
guess is in relation to the search criteria. Crossover takes parent guesses based on how fit they are and
creates offspring guesses which share traits from each parent guess. Mutation injects random
differences into a small percentage of the offspring to avoid homogeneity. Initialization only happens
for the first generation. Selection, crossover, and mutation then repeat until either a certain number of
generations have occurred or a specified fitness level has been obtained.
Example 1: Y optimization
y = -x^2 + 255x; where x is in the range 0 ≤ x ≤ 255.
An easily understood example of a problem that can be answered through genetic algorithms is
determining what the highest integer value of y is in example 1. This problem can easily be solved
through the use of calculus, plotting the line, or even simple guess-and-check, but it will help illustrate
how genetic algorithms work. A unique way to encode the problem must be created before this
process can begin [2].
Encoding
A changeable representation of a guess's traits is decided during encoding. This representation, called
its “string”, can be displayed as a series of bits. A problem with a single parameter may use its binary
representation as its string. Example 1 will use the binary representation of x as its string. The range
of this problem is 0 to 255 so each guess will have an 8-bit string between 00000000 and 11111111.
Example 2: Multiple parameter problem
y = 2x2 + z2 + w; where x, z, and w are in the range 0 ≤ x/z/w ≤ 7.
Problems with multiple parameters may use a concatenation of all binary representations of traits as
their strings. Example 2 contains three 3-bit traits. The three traits can be concatenated to create one 9bit string. This creates a single, manageable entity that can be used in the crossover stage [2].
3
Initialization
Now that an encoding procedure has been decided on, the initialization process can take place.
Initialization is the process of generating an initial population of parents that will seed the second
generation. It is performed only at the beginning of the algorithm and not repeated. A good initial
selection of “genes”, or their binary representation, is needed to inject variety at the start of the process;
therefore, strings are generated at random.
Table 1: Example 1 data
Column 1
Column 2
Column 3
Column 4
Column 5
Column 6
x
Binary x
Fitness
Gene pool
Locus
Offspring
188
10111100
12596
01111101
3
01100110
48
00110000
9936
75
01001011
13500
104
01101000
15704
249
11111001
1494
10
00001010
2450
134
10000110
16214
125
01111101
16250
10011101
10000110
1
11101000
00000110
01101000
1
01001011
01101000
01001011
6
01001001
01111111
Continuing example 1, we will use a population of eight random integers in the range of 0 ≤ x ≤ 255.
A random number generator was used to create the initial numbers, shown in Column 1. Column 2
shows the binary representation of each number. The binary representation is what will be used in the
process of the genetic algorithm [2].
Selection
The next step in the process is selection. Selection deals with the assigning of fitness scores to each
4
potential solution. The level of fitness is determined by how well each solution fulfills the intended
outcome of the search. There has to be a benchmark to determine the strength of the solutions. The
level of fitness decides which solutions will go on to have offspring and which ones will “die off”.
This determines which genes get passed on to new generations.
Example 1 will use each parent's y-value as its fitness score. So to figure out how fit an individual is
we must only insert the x-values from Column 1 into the equation in Example 1 and solve for each yvalue. The fitness factor has been calculated using this method and is represented in Column 3. As can
be seen, there is a disparity between some of the values. Higher fitness factors will be picked more
often when reproducing thus passing on the stronger solutions [2].
There are many different ways to pick which solutions will procreate, and they all involve the fitness
factor. These methods can be classified into the following categories: proportionate selection, ranking
selection, tournament selection, gender-specific selection, genetic relatedness based selection, and
elitism. Each method has its advantages and disadvantages.
Proportionate Selection Methods
Proportionate selection utilizes each individual's fitness in relation to the fitness of other individuals to
determine which individuals will be selected for reproduction. Within proportionate selection there are
the following methods: roulette wheel selection, deterministic sampling, stochastic remainder sampling
with replacement, stochastic remainder sampling without replacement, and stochastic universal
selection.
Roulette wheel selection pits each individual's fitness against the cumulative fitness of the whole
population. First the total fitness factor of the population, Pf, is found by adding up every fitness factor
for the given generation. Then a probability of selection, psel, is calculated for each individual by
dividing its fitness level by the total fitness factor. Each psel is then inserted into an array. A random
number is generated between 0 and the total fitness factor. Starting at the beginning of the array each
element is added up until this sum is greater than the random number. The last element of the array to
be added to the sum is then selected for breeding.
This approach has the advantage of still allowing lower fitness level candidates to reproduce, just at a
lower rate, which helps keep a population's genetic diversity. Other methods may have higher
pressures which just annihilate lower preforming individuals. Methods that have such a high selective
pressure that they completely remove diversity too quickly can have problems getting stuck at a local
5
optimum. A local optimum is a point at which small changes will not improve the fitness of
individuals, but there may be distant genes that lead to higher fitness levels. This is different from the
overall optimum, which is the highest fitness level achievable. The problem comes in when there are
several local optimums. Removing the diversity leaves the generations stuck in one specific local
optimum, only passing on very similar genes.
Deterministic sampling reworks the mating population. The average fitness is determined by summing
all of the fitness levels together and dividing by the total number of individuals in the population. Then
each individual's fitness level is divided by the average fitness of the group. The whole number that is
produced is the number of spots in the mating pool that the individual will occupy. If there are any
spots left in the mating pool after each individuals spots have been computed, then the decimal values
of each individual's computation are sorted and the rest of the spots are filled starting with the highest
value.
After the mating population has been determined, random numbers are generated which point to the
index of each selected individual. Since this is a comparison between the individual's fitness level and
the average fitness level, it gives each individual a proportionate number of spots in the gene pool and
lets fit members procreate more often.
Stochastic remainder sampling with replacement is a combination of deterministic sampling and
roulette wheel selection. It begins by using the deterministic sampling method to fill spots in the
potential gene pool with whole number fitness-to-average-fitness members. It, however, treats the left
over spots differently. It uses the remainder of each individual's fitness-to-average-fitness ratio to fill
an array using the methods of roulette wheel selection. The left over spots in the gene pool are then
filled from the remainder array using same methods applied in the roulette wheel selection method.
Stochastic remainder sampling without replacement uses the same initial method to fill spots in the
gene pool that deterministic sampling and stochastic remainder sampling with replacement use, but has
yet a different way of filling left over spots. It takes the remainders from the fitness-to-average-fitness
ratio and steps through each one performing a weighted-coin toss to determine if it will be selected.
Each remainder is turned into a percentage by multiplying it by 100. A random number is generated
and if the number is lower than the remainder's percentage, its genes are selected for the next open spot.
The iteration continues until all spots are filled [6].
Ranking Selection
6
Ranking selection methods use an approach which gives individuals a chance to breed based on order
with no regard to proportion. They are generally easier to implement and understand, but come at a
cost of possibly being less accurate or less efficient. They tend to phase out diversity too quickly. Due
to their problems, they are generally not used other than for instructional purposes. Ranking selection
contains both linear ranking selection and truncate selection [6].
Linear ranking selection starts by sorting all candidates in order from highest to lowest fitness. A rank
is then given to each candidate. Predefined selection probabilities are determined based on the number
of individuals in the population without respect to individual fitness levels. The probabilities are then
assigned to each candidate based on rank, with the highest rank getting the highest probability all the
way down to the lowest rank getting the lowest probability. Then the individuals are selected for
breeding based on their assigned probabilities [6].
Truncate selection uses a similar ranking system. Candidates are sorted in order from highest to lowest
fitness. From here a predefined percentage of candidates will be chosen for reproduction. A common
example would be choosing the top half of a group and reproducing each one with two other
individuals. This will fill out a new generation, but will lead to a rapid decline in genetic diversity as
half of the gene pool is dying off every generation [6].
Tournament Selection
Tournament selection is characterized by pitting individuals against each other in sometimes randomly
selected brackets to determine which ones get to reproduce. There are no probabilities assigned. It is
all just determined by if the individual is selected and if it is more fit than its opponents.
One of the benefits of tournament selection is that it only needs local information. It only needs to
know the fitness of the selected individuals. It doesn't need to find a total sum or average fitness factor
of all possible candidates. This makes it a good candidate for any problem where it might be
implausible or inefficient to calculate totals. The categories of tournament selection includes: binary
tournament selection, larger tournament selection, Boltzmann tournament selection, and correlative
tournament selection [6].
Binary tournament selection is the simplest version of tournament selection. Two candidates are
randomly selected out of the pool using a random number generator. Between the two candidates,
whichever one has the highest fitness level is chosen to reproduce. This method is fast, efficient, and
easy to implement [6].
7
Larger tournament selection uses the same methodology as binary tournament selection except more
than two individuals compete against each other. There is no difference other than the number of
competitors [6].
Boltzmann tournament selection allows the selective pressure to be changed. It uses a variable, which
can be changed over time, to increase or decrease the affect that the fitness factor has on the selection.
When the variable, called its “temperature”, decreases it forces the selection method to pick more fit
candidates, and when it increases it allows more of a random selection process. This adds a great deal
of range to the application of the method.
Correlative tournament selection is an extension of the regular binary and larger tournament selection
methods. It isn't as much of a different selection method as much as a gene pool pairing mechanism.
Once the pool of parents has been computed, this process pairs the parents based on how closely they
are related. This fleshes out the possible advantages that the two parents may already have in common
[6].
Gender-specific selection
Gender-specific selection copies the foundation of evolution through sexual selection rather than
natural selection. Sexual selection occurs in nature when the ability to mate for one sex is determined
by the desires of the other sex. Gender-specific selection deals more with differences in two selected
parents rather than each parent's ability to survive. It allows for two different approaches for selection
to be utilized at the same time. Gender-specific selection methods include: genetic algorithm with
chromosome differentiation, restricted mating, and correlative family-based selection [6].
Genetic algorithms with chromosome differentiation(GACD) have a wholly different approach to
selection. One reason sexual selection is such a powerful tool is that it forces some genetic variation
within populations. Populations without two sets of genes going into their offspring tend to become
homogenous very quickly. So GACD's differentiate between a “male” class and a “female” class [1]
[6].
To use this method, during encoding an extra two bits are attached to each individual's string called the
class bits. These bits can have the value of either 00, representing a female, or 01, representing a male.
In this way every candidate can be classified as either a male or female. An example of a string used in
8
GACD might be 0111001010, where the initial 01 is the group of class bits and the following
11001010 is the group of data bits which make up the standard trait representation [1].
The males, consisting of half of the population to begin, are generated first randomly as usual. The
females are then created by maximizing the hamming distance between each male and themselves. The
hamming distance is the sum of the differences between each bit of the given male and the female. For
example, given the male string of 0111001010, the female with the highest hamming distance would be
0000110101. In this way, GACD begins its initialization by creating two opposing groups of males
and females which have nothing in common [1].
The pool of potential parents can then either be selected by applying two different selection mechanics
to the two classes or just by lumping them together and using one selection mechanic. These can be
decided on a case by case basis. Once the pool of parents is created, the males and females are mated
twice and removed from the pool until the pool either runs out of males or females or both. If the pool
runs out of either males or females, but not both, then the leftover males or females are mated with the
most fit individual of the opposite sex [1] [6].
The data string of each child is produced through the normal processes in crossover. The sex of the
child, however, is selected by mimicking the natural process of sex selection. One of the class bits
from each parent is taken and added to the child to create its new class bits. The female parent can only
contribute a zero because both of her class bits are zero. The male parent then either contributes its
first class bit to create a new female or its second class bit to create a new male with a class bit of 01.
The class bits are both selected at random. Since each couple produces two offspring and the selected
class bits are replaced when used, a couple may produce two males, two females or one male and one
female [6].
Restricted mating uses the concept of the classification of species in its selection technique. In the
natural world, different species are defined as animals that cannot or do not mate with each other. An
example would be the household dog. Dogs may vary in appearance and temperament quite a bit, but
any two dogs should be able to produce viable offspring. Because of this, all dogs are a part of the same
genus and species, Canis familiaris [7].
In the same way, restricted mating classifies each individual into a specific class. The class is
determined by a certain predefined trait, or subset of the binary representation of the individual. The
different classes are then only allowed to mate within their own “species”. For example, if the first
three bits of a 12-bit string are designated the species trait, then a given individual can only mate with
another individual whose first three bits match their own. A candidate with a string of 100110011101
may mate with 100000101100, but not 000000101100. In this way, the algorithm is able to keep
several separate variations evolving at the same time without worrying about them mixing into one
9
local optimum.
Correlative family-based selection is used to maintain population diversity. Two parents are mated
together twice producing two offspring. The two parents and the two offspring are considered the
family. Firstly, among the family, the individual with the highest fitness level is chosen to go on to the
next generation. This makes it so that if one of the parents has a better fitness level, then they are
passed on instead of an inferior offspring. Secondly, the hamming distance is calculated between each
part of the family and the other three members. The individual with the greatest hamming distance is
also passed on to the next generation. This allows both the most fit member and the member with the
greatest diversity compared to the average to be passed on [6].
Genetic Relatedness Based Selection
Genetic relatedness based selections are created based around the idea of each individual remembering
how it relates to other individuals. It does this by recognizing its closest ancestor in common with
other individuals. It then uses this in a method to interact with its closest living relative. Genetic
relatedness based selections include: fitness uniform selection scheme and reserve selection.
Fitness uniform selection scheme(FUSS) is based on the principle of searching areas of the search
space that haven't been check as much as others. FUSS is not concerned with reproducing the
candidates with the highest fitness values; it is concerned with reproducing candidates that have little
exploration done around their search space [3] [4].
The fitness values of each candidate are still calculated and then they are lumped into fitness levels.
This organization puts similar individuals into their own groups based on fitness level. Then a random
number is generated between the lowest and the highest fitness value. From here, the candidate with
the closest fitness value to the random number is chosen to reproduce. This process tends to favor the
candidates with the fitness values farthest away from other candidates because when they are set closer
together there is a smaller distance between them, which means a smaller range that a random number
could land in to choose them. Each chosen candidate then gets put into the gene pool and reproduces
[3] [4].
This method helps to avoid getting stuck in a local optimum because the individuals have a tendency to
diverge instead of converge. It constantly works to increase diversity and reach new areas of the search
space.
10
Reserve selection has a purpose of not getting stuck in local optimums also. The reserve selection
method has two categories: reserved and non-reserved. Non-reserved candidates tend to have better
fitness scores and are subjected to the normal selection methods. Reserved candidates are generally
low fit individuals that are carried over to new generations, but can also be selected with replacement to
create offspring. Purposely keeping low fit individuals in the gene pool may seem counter-intuitive,
but the purpose is to keep diversity in the pool to avoid getting stuck in local optimums. While a
truncate selection remove low performers from the pool and get stuck in the local optimum, a reserve
selection will always have a way to get out [6].
Elitism
Elitism is not a total selection method in itself. It is an extension that can be attached to any other
selection method. The purpose of elitism is to make sure that the most fit candidate does not die off
due to being unlucky versus probability. Even if the individual with the highest fitness has the highest
chance of reproduction, it still may get unlucky and not make it into the gene pool. To bypass this,
when elitism is used the candidate with the highest fitness level is automatically carried over to the next
generation.
Automatically carrying the top candidate over to the next generation assures that progress will never be
lost by an unlucky generation. This is an advantage because it makes sure ground is never lost in the
search. This can also be a disadvantage, though, because it can make the process much harder or even
impossible to get out of a local optimum [2] [6].
Crossover
Crossover is the process where the combination of genes actually takes place. In nature, crossover
happens when it is determined which parent a child receives each gene from. In the genetic algorithm
usage, it is where one or more points, called locus, in the genetic string are determined and then used to
combine genes. The locus is randomly generated with a value between one and the length of the
genetic bit string minus one. The locus divides the string into two separate strings. The string to the
right is replaced by the string in the equivalent position from the other parent. This creates the genetic
string that makes up the offspring of the two [2].
11
Using the equation in Example 1 and the data in Columns 1-3, a top half truncate selection will be run
to find the reproducing parents. The list of reproducing parents is given in Column 4. Given that a
truncate selection was used, each individual selected will reproduce twice each with two separate
partners. The first parent will reproduce with the second parent, the second parent will also reproduce
with the third parent, the third parent will also reproduce with the forth parent, and the forth parent will
also reproduce with the first parent.
Generating a random number between one and seven, the count of bits in the string minus one, returns
three. This means for the first cross over that will be made the locus is at position three. Each parent
locus will separated in half immediately following the third bit and crossed over with the opposite
string from the other parent. In this example, the first parent's bit string is divided into 011/11101 and
then crossed over with the second parent's divided bit string of 100/00110 to form the first offspring of
01100110 and the second offspring of 10011101. This process is then repeated with the different
combinations of parents to produce all of the offspring to fill the next population in Column 6 using the
randomly generated numbers for each crossover in Column 5.
Mutation
Mutation is the next step in the process. In the natural world, evolution wouldn't work without
mutation. Mutation has to happen to bring the initial diversity into a population. Without mutation the
first life form on Earth would not have evolved into anything else. Mutations are sometimes added
during replication errors. If the mutation gives an advantage to the organism in some aspect, it may
have a greater chance of reproducing and passing on its new mutated genes.
Genetic algorithms purposely inject mutations into genes to artificially replicate the natural process.
Each bit that is created during crossover is then checked against a certain probability to see if it should
be mutated. Mutation probabilities are in general purposely very low. A common mutation rate may
be around 1/1000. This means every bit is checked with a random number generated against the
mutation rate to determine if the bit should be flipped. If the mutation check is successful, then the bit
is XORed with one.
The mutation process, therefore, injects diversity into the population. The regular injection helps to
avoid retaining a homogeneous population. A homogenized population ruins the efficiency of the
algorithm so mutation is a necessary part.
After mutation has completed, the algorithm is left with a fully functioning new generation of
12
individuals. If the algorithm is fitness threshold based, the solutions are checked to see if any of them
have a fitness level over the given threshold. If one does, the algorithm will end, having found the
answer. If the algorithm is generation based, it will check to see if it has reached the specified number
of generations. If not, it will add one to the generation counter and start back at the selection phase for
the new generation [2].
Conclusion
Genetic algorithms have a wide variety of uses due to how flexible they are. Their ability to be
tweaked in different ways to allow for different situations is what makes them so powerful. At their
core they use the fundamental principles of the force that brought life to what it is today from single
cell organisms. This flexibility has seen genetic algorithms put to use in such wide-ranging
implementations as image processing, optimal water network layouts, facial recognition, robotics,
trajectories for spacecraft [2]. Genetic algorithms are an effective solution to large-scale optimization
problems.
References
[1] Bandyopadhyay, S., Pal, S. K., & Maulik, U. (1998). Incorporating chromosome differentiation in
genetic algorithms. Information Sciences, Retrieved from
http://www.isical.ac.in/~sanghami/bandyopa_ieeetgrs.pdf.gz
[2] Coley, D. (2003). An Introduction to Genetic Algorithms for Scientists and Engineers. Singapore:
World Scientific Publishing Co. Pte. Ltd.
[3] Hutter, M. (2001). Fitness Uniform Selection to Preserve Genetic Diversity. Retrieved from
http://arxiv.org/pdf/cs.AI/0103015.pdf
[4] Hutter, M. (2004). Tournament versus fitness uniform selection. Retrieved from
http://arxiv.org/pdf/cs.LG/0403038.pdf
[5] Pragasen, P., Nolan , R., & Towhidul , H. (1997). Application of genetic algorithms to motor
parameter determination for transient torque calculations. IEEE TRANSACTIONS ON
INDUSTRY APPLICATIONS, 33(5), 1275. Retrieved from
http://spectrum.library.concordia.ca/6537/1/Pillay_application_genetic.pdf
[6] Sivaraj, R. (2011). A Review of Selection Methods In Genetic Algorithm. International Journal of
Engineering Science and Technology, 5(3), Retrieved from http://www.ijest.info/docs/IJEST11-
13
03-05-190.pdf
[7] Swaminathan, N. (n.d.). Why are different breeds of dogs all considered the same species? .
Retrieved from http://www.scientificamerican.com/article.cfm?id=different-dog-breeds-samespecies