Download CS 478 - Machine Learning

CS 478 - Machine Learning Genetic Algorithms (I) Darwin’s Origin of Species: Basic Principles (I)        Individuals survive based on their ability to adapt to the pressures of their environment (i.e., their fitness) Fitter individuals tend to have more offspring, thus driving the population as a whole towards favorable traits During reproduction, the traits found in parents are passed onto their offspring In sexual reproduction, the chromosomes of the offspring are a mix of those of their parents The traits of offspring are partially inherited from their parents and partially the result of new genes/traits created during the process of reproduction Nature produces individuals with differing traits Over long periods, variations can accumulate, producing entirely new species whose traits make them especially suited to particular ecological niches Fall 2004 CS 478 - Machine Learning 2 Darwin’s Origin of Species: Basic Principles (II)  Evolution is effected via two main genetic mechanisms:  Crossover     Mutation   Fall 2004 Take 2 candidate chromosomes “Randomly” choose 1 or 2 crossover points Swap the respective components to create 2 new chromosomes Choose a single offspring Randomly change some aspect of it CS 478 - Machine Learning 3 Intuition   Essentially a pseudo-random walk through the population with the aim of maximizing some fitness function From a starting population:    Crossover ensures exploitation Mutation ensures exploration GAs are based on these principles Fall 2004 CS 478 - Machine Learning 4 Natural vs Artificial  Individual Population Fitness  Chromosome   Gene  Crossover and mutation Natural selection      Fall 2004     Candidate solution Set of candidate solutions Measure of quality of solutions Encoding of candidate solutions Part of the encoding of a solution Search operators Re-use of good (sub)solutions CS 478 - Machine Learning 5 Phenotype vs. Genotype   In Genetic Algorithms (GAs), there is a clear distinction between phenotype (i.e., the actual individual or solution) and genotype (i.e., the individual's encoding or chromosome). The GA, as in nature, acts on genotypes only. Hence, the natural process of growth must be implemented as a genotype-to-phenotype decoding. The original formulation of genetic algorithms relied on a binary-encoding of solutions, where chromosomes are strings of 0s and 1s. Individuals can then be anything so long as there is a way of encoding/decoding them using binary strings. Fall 2004 CS 478 - Machine Learning 6 Simple GA    One often distinguishes between two types of genetic algorithms, based on whether there is a complete or partial replacement of the population between generations (i.e., whether there is overlap or not between generations). When there is complete replacement, the GA is said to generational, whilst when replacement is only partial, the GA is said to be steadystate. If you look carefully at the algorithms below, you will notice that even the generational GA gives only partial replacement when cloning takes place (i.e., cloning causes overlap between generations). Moreover, if steady-state is performed on the whole population (rather than on a proportion of fittest individuals), then the GA is generational. Hence, the distinction is more a matter of how reproduction takes place than a matter of overlap. Fall 2004 CS 478 - Machine Learning 7 Generational GA   Randomly generate a population of chromosomes While (termination condition not met)     Fall 2004 Decode chromosomes into individuals Evaluate fitness of all individuals Select fittest individuals Generate new population by cloning, crossover and mutation CS 478 - Machine Learning 8 Steady-state GA   Randomly generate a population of chromosomes While (termination condition not met)      Fall 2004 Decode chromosomes into individuals Evaluate fitness of all individuals Select fittest individuals Produce offspring by crossover and mutation Replace weakest individuals with offspring CS 478 - Machine Learning 9 Genetic Encoding / Decoding   We focus on binary encodings of solutions We first look at single parameters (i.e., single gene chromosomes) and then vectors of parameters (i.e., multi-gene chromosomes) Fall 2004 CS 478 - Machine Learning 10 Integer Parameters  Let p be the parameter to be encoded. There are three distinct cases to consider:  p takes values from {0, 1, ..., 2N-1} for some N   p takes values from {M, M+1, ..., M+2N-1} for some M and N   Then (p - M) can be encoded directly by its equivalent binary representation. p takes values from {0, 1, ..., L-1} for some L such that there exists no N for which L=2N  Fall 2004 Then p can be encoded directly by its equivalent binary representation Then there are two possibilities: clipping or scaling CS 478 - Machine Learning 11 Clipping   Clipping consists of taking N=log(L)+1 bits and encoding all parameter values 0  p  L-2 by their equivalent binary representation, letting all other N-bit strings serve as encodings of p=L-1. For example, assume p takes values in {0, 1, 2, 3, 4, 5}, i.e., L=6. Then N=log(6)+1=3.    Here, not only is 101 an (expected) encoding of p=L-1=5, but so are 110 and 111 Advantages: easy to implement. Disadvantages: strong representational bias, i.e., all parameter values between 0 and L-2 have a single encoding, whilst the single parameter value L-1 has 2N-L+1 encodings. Fall 2004 CS 478 - Machine Learning 0 1 2 3 4 5 5 5 000 001 010 011 100 101 110 111 12 Scaling   Scaling consists of taking N=log(L)+1 bits and encoding p by the binary representation of the integer value e such that p = e(L-1)/(2N-1) For example, assume p takes values in {0, 1, 2, 3, 4, 5}, i.e., L=6. Then N=log(6)+1=3.    Here, the binary encodings are not generally numerically equivalent to the integer values they code Advantages: easy to implement and smaller representational bias than clipping (each value of p has 1 or 2 encodings, with double encodings evenly spread over the values of p) Disadvantages: more computation needed and still a small representational bias Fall 2004 CS 478 - Machine Learning 0 0 1 2 2 3 4 5 000 001 010 011 100 101 110 111 13 Real-valued Parameters (I)  Real values may be encoded as fixed point numbers or integers via scaling and quantization  If p ranges over [min, max], then p is encoded by the binary representation of the integer part of: N (2  1) p max  min Fall 2004 CS 478 - Machine Learning 14 Real-valued Parameters (II)  Real values may also be encoded using thermometer encoding   Fall 2004 Let T be an integer greater than 1 Thermometer encoding of real values on T bits consists of normalizing all real values to the interval [0,1] and converting each normalized value x to a bit-string of xT (rounded down) 1s followed by trailing 0s as needed. CS 478 - Machine Learning 15 Vectors of Parameters    Vectors of parameters are encoded on multi-gene chromosomes by combining the encodings of each individual parameter Let ei=[bi0, ..., biN] be the encoding of the ith of M parameters There are two possibilities for combining the ei 's onto a chromosome:    Concatenating: Here, individual encodings simply follow each other in some pre-defined order, e.g., [b10, ..., b1N, ..., bM0, ..., bMN] Interleaving: Here, the bits of each individual encoding are interleaved, e.g., [b10, ..., bM0}}, ..., b1N, ..., bMN] The order of parameters in the vector (resp., genes on the chromosome) is important, especially for concatenated encodings Fall 2004 CS 478 - Machine Learning 16 Gray Coding (I)   A Gray code represents each number in the sequence of integers 0, 1, ..., 2N-1 as a binary string of length N, such that adjacent integers have representations that differ in only one bit position A number of different Gray codes exist. One simple algorithm to produce gray codes starts with all bits set to 0 and successively flips the right-most bit that produces a new string Fall 2004 CS 478 - Machine Learning 17 Gray Coding (II) Integer Value Binary coding Gray coding 0 1 2 3 4 5 6 7 000 001 010 011 100 101 110 111 000 001 011 010 110 111 101 100 Fall 2004 CS 478 - Machine Learning 18 Gray Coding (III)    Advantages: random bit-flips (e.g., during mutation) are more likely to produce small changes (i.e., there are no Hamming cliffs since adjacent integers' representations differ by exactly one bit). Disadvantages: big changes are rare but bigger than with binary codes. For example, consider the string 001. There are 3 possible bit flips leading to the strings 000, 011 and 101    Fall 2004 With standard binary encoding, 2 of the 3 flips lead to relatively large changes (from 001(=1) to 011(=3) and from 001(=1) to 101(=5), respectively) With Gray coding, 2 of the 3 flips produce small changes (from 001(=1) to 000(=0) and from 001(=1) to 011(=2), respectively) However, the less probable (1 out of 3) flip from 001 to 101 produces a bigger change under Gray coding (to 6) than under standard binary encoding (to 5) CS 478 - Machine Learning 19 GA operators   We will restrict our discussion to binary strings The basic GA operators are:    Fall 2004 Selection Crossover Mutation CS 478 - Machine Learning 20 Selection     Selection is the operation by which chromosomes are selected for reproduction Chromosomes corresponding to individuals with a higher fitness have a higher probability of being selected There are a number of possible selection schemes (we discuss some here) Fitness-based selection makes the following assumptions:     Fall 2004 There exists a known quality measure Q for the solutions of the problem Finding a solution can be achieved by maximizing Q For all potential solutions (good or bad), Q is positive. A chromosome's fitness is taken to be the quality measure of the individual it encodes CS 478 - Machine Learning 21 Fitness-proportionate Selection    This selection scheme is the most widely used in GAs Let fi be the fitness value of individual i and let favg be the average population fitness Then, the probability of an individual i being selected is given by: pi  fi N  fj 1 fi  N f avg j 1 Fall 2004 CS 478 - Machine Learning 22 Roulette Wheel     Fitness-proportionate selection (FPS) can be implemented with the roulette-wheel algorithm A wheel is constructed with markers corresponding to fitness values For each fitness value fi, the size of the marker (i.e., the proportion of the wheel's circumference) associated to fi is given by pi as defined above Hence, when the wheel is spun, the probability of the roulette landing on fi (and thus selecting individual i) is given by pi, as expected Fall 2004 CS 478 - Machine Learning 23 Vector Representation    A vector v of M elements from {1, ..., N} is constructed so that each subsequent i in {1, ..., N} has M.pi entries in v A random index r from {1, ..., M} is selected and individual v(r) is selected Example:    Fall 2004 4 individuals such that f1=f2=10, f3=15 and f4=25 If M=12, then v=(1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 4) Generate r=6, then individual v(6)=3 is selected CS 478 - Machine Learning 24 Cumulative Distribution  A random real-valued number r in  N  0,  f j   j 1   is chosen and individual i, such that i 1 i j 1 j 1  fj r   fj  is selected (if i=1, the lower bound sum is 0). Fall 2004 CS 478 - Machine Learning 25 Discussion (I)   The implementation based on cumulative distribution is effective but relatively inefficient, whilst the implementation based on vector representation is efficient but its effectiveness depends on M (i.e., the value of M determines the level of quantization of the pi's and thus accuracy depends on M). Assume that N individuals have to be selected for reproduction. The expected number of copies of each individual in the mating pool is: N i  Npi   fi f avg Hence, individuals with above-average fitness tend to have more than one copy in the mating pool, whilst individuals with below-average fitness tend not to be copied. This leads to problems with FPS. Fall 2004 CS 478 - Machine Learning 26 Discussion (II)  Premature convergence   Stagnation   Assume an individual X with fi >> favg but fi << fmax is produced in an early generation. As Ni >> 1, the genes of X quickly spread all over the population. At that point, crossover cannot generate any new solutions (only mutation can) and favg << max forever. Assume that at the end of a run (i.e., in one of the consecutive generations) all individuals have a relatively high and similar fitness, i.e., fi is almost fmax for all i. Then, Ni is almost 1 for all i and there is virtually no selective pressure. Both of these problems can be solved with fitness scaling techniques Fall 2004 CS 478 - Machine Learning 27 Fitness Scaling   Essentially, fitness values are scaled down at the beginning and scaled up towards the end There are 3 general scaling methods:  Linear scaling  f is replaced by f’=a.f + b, where a and b are chosen such that:     f’avg=favg (i.e., the scaled average is the same as the raw average) f’max=c. favg (c is the number of expected copies desired for the best individual; usually c=2) The scaled fitness function may take on negative values if there are a few bad individuals with fitness much lower than favg and favg is close to fmax . One solution is to arbitrarily assign the value 0 to all negative fitness values. Sigma truncation  f is replaced by f'=f - (favg - c.), where  is the population standard deviation, c is a reasonable multiple of  (usually 1  c  3) and negative results are arbitrarily set to 0. Truncation removes the problem of scaling to negative values. (Note that truncated fitness values may also be scaled if desired)  Power law scaling  f is replaced by f'=fk for some suitable k. This method is not used very often. In general, k is problem-dependent and may require dynamic change to stretch or shrink the range as needed Fall 2004 CS 478 - Machine Learning 28 Rank Selection    All individuals are sorted by increasing values of their fitness Then, each individual is assigned a probability pi of being selected from some prior probability distribution Typical distributions include:    Rank selection (RS) has little biological plausibility. However, it has the following desirable features:     Linear: Here, pi=a.i + b Negative exponential: Here, pi=a.eb.i + c No premature convergence. Because of the ranking and the probability distribution imposed on it, even less fit individuals will be selected (e.g., let there be 3 individuals such that f1=90, f2=7, f3=3, and pi=-0.4i + 1.3. With FPS, p1=0.9 >> p2=0.07 and p3=0.03, so that individual 1 comes to saturate the population. With RS, p1=0.9, p2=0.5 and p3=0.1, so that individual 2 is also selected). No stagnation. Even at the end, N1  N2  ... (similar argument to above). Explicit fitness values not needed. To order individuals, only the ability of comparing pairs of solutions is necessary. However: rank selection introduces a reordering overhead and makes a theoretical analysis of convergence difficult. Fall 2004 CS 478 - Machine Learning 29 Tournament Selection   Tournament selection can be viewed as a noisy version of rank selection. The selection process is two-stage:    Select a group of N ( 2) individuals Select the individual with the highest fitness from the group and discard all others Tournament selection inherits the advantages of rank selection. In addition, it does not require global reordering and is more naturally-inspired. Fall 2004 CS 478 - Machine Learning 30 Elitist Selection    The idea behind elitism is that at least one copy of the best individual in the population is always passed onto the next generation. The main advantage is that convergence is guaranteed (i.e., if the global maximum is discovered, the GA converges to that maximum). By the same token, however, there is a risk of being trapped in a local maximum. One alternative is to save the best individual so far in some kind of register and, at the end of each run, to designate it as the solution instead of using the best of the last generation. Fall 2004 CS 478 - Machine Learning 31 1-point Crossover   Here, the chromosomes of the parents are cut at some randomly chosen common point and the resulting sub-chromosomes are swapped For example    P1=1010101010 and P2=1110001110 Crossover point between the 6th and 7th bits Then the offspring are:   Fall 2004 O1=1010101110 O2=1110001010 CS 478 - Machine Learning 32 2-point Crossover    Here, the chromosomes are thought of as rings with the first and last gene connected (i.e., wrap-around structure) The rings are cut in two sites and the resulting sub-rings are swapped For example    P1=1010101010 and P2=1110001110 Crossover points are between the 2nd and 3rd bits, and between the 6th and 7th bits Then the offspring are:   Fall 2004 O1=1110101110 O2=1010001010 CS 478 - Machine Learning 33 Uniform Crossover   Here, each gene of the offspring is selected randomly from the corresponding genes of the parents For example   P1=1010101010 and P2=1110001110 Then the offspring could be:   O=1110101110 Note: produces a single offspring Fall 2004 CS 478 - Machine Learning 34 Mutation   Mutation consists of making (usually small) alterations to the values of one or more genes in a chromosome In binary chromosomes, it consists of flipping random bits of the genotype. For example, 1010101010 may become 1011101010 if the 4th bit is flipped. Fall 2004 CS 478 - Machine Learning 35

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download CS 478 - Machine Learning