* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Homologous and Nonhomologous Rearrangements: Interactions
Adaptive evolution in the human genome wikipedia , lookup
Genetic engineering wikipedia , lookup
Designer baby wikipedia , lookup
Gene expression programming wikipedia , lookup
Neocentromere wikipedia , lookup
History of genetic engineering wikipedia , lookup
Public health genomics wikipedia , lookup
Oncogenomics wikipedia , lookup
Segmental Duplication on the Human Y Chromosome wikipedia , lookup
Transposable element wikipedia , lookup
Frameshift mutation wikipedia , lookup
Population genetics wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Genetic code wikipedia , lookup
Metagenomics wikipedia , lookup
Koinophilia wikipedia , lookup
Genome (book) wikipedia , lookup
Multiple sequence alignment wikipedia , lookup
Minimal genome wikipedia , lookup
Genomic library wikipedia , lookup
Sequence alignment wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Non-coding DNA wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Metabolic network modelling wikipedia , lookup
Human Genome Project wikipedia , lookup
Pathogenomics wikipedia , lookup
Human genome wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Helitron (biology) wikipedia , lookup
Point mutation wikipedia , lookup
Microevolution wikipedia , lookup
Homologous and Nonhomologous Rearrangements: Interactions and Effects on Evolvability David P. Parsons1,3 , Carole Knibbe2,3 and Guillaume Beslon1,3 1 2 Université de Lyon, CNRS, INRIA, INSA-Lyon, LIRIS, UMR5205, F-69621, France Université de Lyon, CNRS, INRIA, Université Lyon 1, LIRIS, UMR5205, F-69622, France 3 IXXI, Institut Rhône-Alpin des Systèmes Complexes, Lyon, F-69007, France [email protected] Abstract By using Aevol, a simulation framework designed to study the evolution of genome structure, we investigate the effect of homologous rearrangements on the course of evolution. We designed an efficient model of rearrangements based on an intermittent search algorithm. Then, using experimental in silico evolution, we explore the effect of rearrangement rates on the genome structure. We show that the effect of homologous rearrangements is quite complex. At first glance they appear to be dangerous enough to trigger an indirect selective pressure leading to short genomes when the rearrangement rate is high. However, by analyzing the successful lineage in the best runs, we found that there is a positive correlation between the number of homologous rearrangements and the fitness improvement in these lineages. Thus the impact of homologous rearrangements on evolution is rather complex: dangerous on the one hand but necessary on the other hand, to ensure a sufficient level of evolvability to the organisms. Moreover, our results show that the spontaneous rate of small mutations influences the relative proportions of homologous versus nonhomologous rearrangements. Introduction Chromosomal rearrangements are known to play a major role in evolution. Their most visible effects are quite straightforward: duplications and deletions account for numerous gene acquisitions or losses while translocations and inversions have a direct influence on gene order. However, these direct effects are flanked by other indirect selective pressures. The rates and mechanisms of rearrangements indeed influence the evolvability (Kirschner and Gerhart, 1998) of the lineage and, as it was stated by Earl and Deem (2004), evolvability itself can be subject to evolution. In the long term, more evolvable lineages are more likely to produce beneficial mutations and hence to overcome lineages with lower evolvability. Similarly, Wilke et al. (2001) showed a second-order selective pressure on mutational robustness. The selection of a specific level of evolvability or robustness is said to be indirect because they do not influence the fitness of the organism, but that of its descendants. Unraveling these second-order pressures is a very challenging matter. Indeed, the underlying processes are complex and act on a very long time scale. It is hence difficult to tackle such questions either in vivo or in vitro. Comparative genomics approaches are a way to circumvent this difficulty. However, they are based upon the static snapshots of the contemporary sequences and have to infer their evolutionary past. Artificial life and in silico simulations are very useful in such cases, providing us with insights into complex mechanisms and shedding light onto second-order pressures that would have been difficult to identify otherwise (Wilke et al., 2001; Adami, 2006; Misevic et al., 2006; Knibbe et al., 2007; Beslon et al., 2010). They offer a dynamic view of the evolutionary process and provide the experimentalist with a very good control over parameters as well as a perfect fossil record throughout the evolution. The Aevol model was developed specifically to study the evolution of genome structure. Experiments using this model underlined the major importance of chromosomal rearrangements in the evolutionary process. For a start, we observed that in total absence of chromosomal rearrangements, evolution can hardly occur at all because gene duplications are necessary to acquire new genes and thus new functions. Secondly, it has been shown that, because of rearrangements, non-coding sequences can become mutagenic for the surrounding genes. The consequence is a clear trend for organisms having evolved under high rearrangement rates to own shorter and denser genomes than those having evolved under lower rates of rearrangement (Knibbe et al., 2007; Parsons et al., 2010). As we have already shown, this effect is the consequence of the long-term selection of a specific level of mutational variability (Knibbe et al., 2007). Unlike point mutations and indels that produce local variations, chromosomal rearrangements can involve huge sequences and turn a very fit individual into an ill-adapted one in a single event. Chromosomal rearrangements can hence be very dangerous. However, rearrangements are usually not fully random. Most rearrangements are the consequence of error-repair mechanisms such as the RecA mediated double strand break repair mechanism (Neidhardt, 1996). These mechanisms usually require that the sequences be similar (at least around the breakpoints) to be rearranged. Such rear- rangements based on sequence similarity are called homologous rearrangements. By contrast, we call here nonhomologous rearrangements those that occur between sequences of low similarity. It is tempting to think that, because they are partially directed, homologous rearrangements could be less dangerous than rearrangements occurring at random points. To investigate the role of homologous rearrangements in genome evolution, we modified the Aevol model to introduce a sensitivity to sequence similarity in the rearrangement process: a rearrangement is now more likely to occur between similar sequences (homologous recombination) but remains possible, although at a low probability, when the breakpoints differ (nonhomologous recombination). After an overall presentation of the Aevol model, focusing particularly on the way we take sequence homologies into account in the rearrangement process, we will present our results regarding the different effects of homologous and nonhomologous rearrangements. We will discuss the intricate relationship that exists between homologous and nonhomologous rearrangements, and their impact on evolvability. Aevol: A digital genetics model The Aevol model was developed in our team to study the evolution of genome structure. It simulates the evolution of a population of N artificial haploid organisms with flexible genomes. Although a description of the model has already been published (see Knibbe et al. (2008) and its supp. mat.), we thereafter provide an overview of the most important principles that are necessary to have a good understanding of the results presented here. In Aevol, each artificial organism owns a genome whose structure is inspired by prokaryotic genomes. It is organized as a circular double-strand binary string containing a variable number of genes separated by non-coding sequences (figure 1). Genes are identified and decoded thanks to an explicit transcription-translation process based upon predefined signaling sequences. Then, an abstract “folding” process gives rise to artificial “proteins” that are able to realize or deflect a particular range of abstract “biological functions”. The interaction of all these proteins yields the set of functions the organism is able to perform, which will in turn be compared to an environmental target to determine how well-adapted this individual is. At each generation, N new individuals are created by reproducing preferentially the best individuals of the parental generation which is then completely replaced. During the replication process, the chromosome can undergo different kinds of modifications: local mutations (point mutations, small insertions and small deletions), but also large chromosomal rearrangements (duplications, deletions, translocations and inversions). At the beginning of the run, all the organisms are initialized with the same random sequence (of 5,000 base-pairs here) which contains at least one gene. Promoter Double stranded genome with scattered genes Shine-Dalgarno START Coding DNA Sequence STOP Terminator Figure 1: In Aevol, each individual owns a circular doublestranded binary genome upon which coding sequences are identified thanks to predefined signalling sequences: promoters and terminators mark the boundaries of transcribed sequences and, inside these transcribed regions, coding sequences can exist between a S TART signal and an in-frame S TOP codon (see figure 2 for the genetic code). From genotype to phenotype Transcription In prokaryotes, transcription initiates at particular sites, called promoters, where the RNApolymerases recognize a consensus sequence to which they can bind and begin the RNA synthesis. In Aevol, we defined a long consensus sequence, a promoter being a sequence whose Hamming distance d with this consensus is less than or equal to dmax . In the experiments presented here, the consensus was a 22-base-pairs (bp) sequence and up to dmax = 4 mismatches were allowed. This consensus sequence is long enough to ensure that random, non-coding sequences have a low probability to become coding by a single mutation event. When a promoter is found, the transcription goes on until a terminator is reached. We defined terminators as sequences that would be able to form a stem-loop structure, as the ρ-independent bacterial terminators do. In these experiments, the stem size was set to 4 and the loop size to 3, terminators thus had the following structure: abcd ∗ ∗ ∗ dcba, where a, b, c, d = 0 or 1. The expression level e of an RNA is determined according d to its promoter sequence: e = 1 − dmax +1 . This modulation of the expression level models in a simplified way the basal interaction of the RNA polymerase with the promoter, without additional regulation. It provides duplicated genes with a way to reduce temporarily their phenotypic contribution while diverging toward other functions. Translation Transcribed sequences (RNAs) do not necessarily result in a protein. The translation process of an RNA takes place when a Shine-Dalgarno-like sequence is found, followed, a few base-pairs away, by a S TART codon (see genetic code on figure 2). Whenever this signal 011011****000 is found, the following sequence is read three bases (one codon) at a time until the S TOP codon (001) is found on the same reading frame. Each codon lying be- foreach Generation do // Evaluation foreach Individual do Identify coding sequences foreach CodingSequence do Translate into abstract protein end Compute phenotype by combining protein contributions Compute fitness by comparing the phenotype to the environmental target end // Selection Sort the individuals by fitness Compute the probabilities of reproduction Draw the actual numbers of offspring // Reproduction foreach Individual do foreach Offspring do Do Rearrangements Do Local Mutations end end Replace current population end Algorithm 1: Aevol General Algorithm tween the initiation and termination signals is translated into an abstract “Amino-Acid” using an artificial genetic code, therefore giving rise to the protein’s primary sequence (figure 2). As in real organisms, genes can be found on six different reading frames (three on each strand), giving the possibility for the organisms to evolve overlapping genes, which are commonly found in virus and bacteria. Protein “folding” and phenotype computation To model the activity of proteins and the resulting phenotype, we defined a simple “artificial chemistry” (Dittrich et al., 2001) that describes the organism’s metabolism in a mathematical language. In our simplified artificial world, we assume that there is an abstract, one-dimensional space Ω = [0, 1] representing all the possible metabolic processes (that is, in this model, a metabolic process is just a real number). In this “metabolic space”, each protein is involved in a subset of processes (either realizing it or preventing other proteins from realizing it) which is described using the fuzzy set formalism: a given protein can be involved in a metabolic process with a possibility degree lying between 0 and 1. A protein is thus fully characterized by a mathematical function that associates a possibility degree to each metabolic process, describing the fuzzy subset of metabolic processes it is involved in. For simplicity, we use piecewise-linear functions with a symmetric, triangular shape (figure 2). In Shine-Dalgarno Coding sequence START STOP 5’ UTR Promoter …001…0101…0110…0010…0110110011000101111011101110011010001… …100…1010…1001…1101…1001001100111010000100010001100101110… Expression level = e Genetic code 000 001 100 101 010 011 110 111 START STOP M0 M1 W0 W1 H0 H1 M1-H1-W1-M1-H0-W1-W0 Bin code M : Bin code W : Bin code H : 11 110 10 Norm. 0,66 0,07 0,33 Possibility degree e.|h| M Function W Figure 2: Overview of the transcription-translation-folding process in Aevol. Transcribed sequences are those that start with a promoter (consensus sequence) and end with a terminator sequence (stem-loop structure), not shown on the figure. Coding sequences (genes) are searched within the transcribed sequences; They begin with a Shine-DalgarnoS TART sequence and end with a S TOP codon. An artificial genetic code (right) is used to convert a gene into the primary sequence of the corresponding protein and a “folding process” enables us to compute the metabolic activity of this protein (functional abilities). this way, only three numbers are needed to characterize the metabolic activity of a protein: the position m (m ∈ Ω) of the triangle on the axis, its half-width w and its height h (positive when realizing a function, negative when inhibiting it). This means that the protein contributes to the range [m−w, m+w] of metabolic processes, with a preference for the processes closest to m (for which the highest efficiency, h, is reached). Thus, various types of proteins can co-exist, from highly efficient and specialized ones (small w, high h) to polyvalent but poorly efficient ones (large w, low h). In this framework, each protein’s primary sequence is decomposed into three interlaced binary subsequences that will in turn be interpreted as the values for the m, w and h parameters. For instance, the codon 010 (resp. 011) is translated into the single amino acid W 0 (resp. W 1), which means that it contributes to the value of w by adding a bit 0 (resp. 1) to its binary code. Small mutations in the coding sequence (point mutations, indels, possibly causing frame shifts) will change these parameters, resulting in a modification of the protein’s metabolic activity. Once all the proteins encoded on the genotype of the organism have been identified and characterized, their activities are combined into a fuzzy set representing the individual’s phenotype P = (∪Ai ) ∩ (∪Ij ), using Lucasiewicz’ fuzzy operators, with Ai being the fuzzy subset of the i-th activating protein (hi > 0) and Ij the fuzzy subset of the j-th inhibiting protein (hj < 0). Intuitively, this means that metabolic processes achieved by the organism are those that are activated and not inhibited. The phenotypic fuzzy set P indicates to what extent the individual can realize each metabolic process in our abstract metabolic space. Environment, adaptation and selection In Aevol, the environment is represented by a phenotypic target: the fuzzy set E defined on Ω that represents the optimal degree of possibility for each “biological function”. To evaluate an individual, we compare its phenotype P to the optimal phenotype E. The “metabolic error” g is computed as the geometric area between these two sets (figure 3). The lower the metabolic error, the better the individual. This measure penalizes both the under-realization and the over-realization of each function. The rates at which each type of local mutation occurs are parameters of the model. They are defined as the perbase, per-replication probability of each type of mutation to take place. The chromosomal rearrangement rates however, can not be a direct parameter of the model. Indeed, in this version of the model, a rearrangement is all the more likely to occur that the sequences at the breakpoints are similar. The probability of a chromosomal rearrangement to occur hence depends on the sequence itself and consequently, is subject to evolution. Details about how we modeled these homology-driven chromosomal rearrangements are provided in the next section. Genetic exchange (crossover) between individuals was not allowed in the simulations presented here, because we first needed to assess the impact of similarity-based intrachromosomal rearrangements in the simple case of an asexual population. We plan to allow for similarity-based genetic exchange in future experiments. Homology-driven chromosomal rearrangements Figure 3: Measure of individual adaptation. Dashed curve: environmental target E. Solid curve: phenotypic distribution P (resulting metabolic profile obtained after combining all the proteins). Dark grey filled area: metabolic error g. In the current version of Aevol, the population size is constant (here N = 1, 000 individuals) and the population is entirely renewed at each generation. A probability of reproduction is assigned to each individual according to its metabolic error and a multinomial drawing determines the actual number of offsprings each individual will have. In the experiments presented here, we used an exponential ranking selection (Blickle and Thiele, 1996). The individuals are sorted by decreasing metabolic error so that the worst individual has rank r = 1 and the best r = N . The probability N −r of reproduction of an individual is then given by ss−1 , N −1 s with s = 0, 998 being the intensity of selection in all the experiments presented here. Genetic operators During their replication, genomes can undergo different modifications: local mutations (point mutations, insertions or deletions of 1 to 6 bp) and chromosomal rearrangements (duplications, deletions, translocations, inversions). Mutations and rearrangements affect the genome but do not necessarily have a phenotypic effect. For instance, a mutation that takes place in an untranscribed region will be completely neutral unless it creates a new promoter, which is reasonably rare given the size of the consensus sequence. Taking homologies into account in the chromosomal rearrangement process requires some knowledge regarding sequence repeats on the chromosome. A naive approach would be to compute a complete alignment search of the genome on itself and then to proceed to the rearrangements if any. However, searching for alignments between sequences is known to be a computationally costly problem. In our particular case, where we deal with millions of genomes (classically 1,000 genomes per generation for thousands of generations), even a heuristic search such as BLAST (Altschul et al., 1990) would be forbiddingly long to compute. Another possible approach, chosen here, is to use intermittent searches (Bénichou et al., 2005), that provide us with a partial yet sufficient knowledge of sequence alignments within the genome. In bacteria, several mechanisms can result in a rearranged chromosome. All these mechanisms have a basic prerequisite of spatial proximity: two sequences must be physically close together in the cytoplasm, at least at the breakpoints, for them to rearrange. As the chromosome is supercoiled, two sequences that are very distant from each other on the chromosome can very well be next to each other in the threedimensional conformation. Since the mechanisms that constrain the spatial conformation of the genome according to its sequence are still poorly understood in bacteria, here we simply picked random pairs of sequences on the genome and consider them to be neighbours. How many pairs of points are to be drawn depends on both the genome length and its degree of supercoiling. Consider any given sequence on the genome. The number of other sequences that are localized in its surroundings depends on how densely packed the genome is. In a highly supercoiled genome, for instance, all the sequences are very 1.0 0.6 0.4 0.2 Probability 0.8 max_shi1 « Good » alignment found 0.0 Local alignment search zones Sequence 2 Whole Genome Candidate pair of points Whole Genome (a) Global View 0 Sequence 1 20 40 60 80 100 Alignment Score (b) Zoom on Local Search (c) Alignment probabilities Figure 4: (a) For each pair of points that are candidate for a rearrangement to occur, a local alignment search is performed between the surrounding sequences either in direct or indirect sense. (b) The searching zone is defined by 2 parameters: the half length of the searching zone and the maximum slippage max shift authorized between the sequences. In the experiments presented in this paper, we used values of respectively 50 for half length and 20 for max shift. (c) Solid line: probability to find a sequence of the given score on a random sequence. Dashed line: the function prear (score) used to map scores to rearrangement probabilities in our experiments. tightly packed together so any sequence has many neighbours and thus many rearrangement opportunities. We thus introduced a specific parameter in the model, the “neighbourhood rate” (µn ), that expresses this degree of supercoiling. The number of pair of points to consider for a possible rearrangement will then be given by L ∗ µn , with L, the genome length in bp. Here, µn is a parameter defined for the whole population and cannot change during its evolution. For each candidate pair of points, a basic local alignment search will be performed to determine the existence of similarities between the surrounding sequences either in a direct or indirect sense (figure 4(a)). To that end, we defined a simple scoring function (+1 per match, -2 per mismatch) that allows us to quantify the similarity of two sequences1 , and associated each score to a probability of rearrangement. The kind and number of rearrangements are computed thanks to algorithm 2. Preliminary experiments allowed us to adjust the function prear (score), that maps alignement scores to probabilities of rearrangement. To favour homologous over nonhomologous rearrangements, alignment scores that are seldom found on random sequences (high scores) are associated with very high rearrangement probabilities (homologous rearrangements). Low score alignments on the other hand, are likely to result from contingency, and will hence be given low probabilities of rearrangement (nonhomologous rearrangements). Figure 4(c) shows the probability of finding an alignment of a given score on a random sequence as well as the function prear (score) we used in the following experiments. This particular function yields a reasonable tradeoff between homologous and nonhomologous 1 Even though it is possible to allow for gaps within alignments, the computation cost would be too important. Hence, in the experiments presented here, no gaps were allowed. rearrangements. initial nb pairs ← L ∗ µn nb pairs ← initial nb pairs while nb pairs > 0 do Draw 2 random positions pos1 and pos2 Draw type of rearrangement if Inversion then sense ← indirect else sense ← direct Draw minimal alignment score using p−1 rear Search Alignment(pos1, pos2, sense, min score) if Alignment found then Proceed to Rearrangement Update L end nb pairs ← nb pairs − 1 nb pairs nb pairs ← initial nb pairs ∗ L ∗ µn end Algorithm 2: Aevol Rearrangement Process Algorithm Results Our model being quite complex, our experimental methods are very similar to those used in “wet” experimental evolution. We let 60 populations of 1,000 asexual individuals evolve during 20,000 generations in near identical conditions where the only changing parameters were the mutation rate (one common rate µm for the three different types of local mutations, 4 values ranging from 5.10−6 to 1.10−4 were tested) and the neighbourhood rate (µn , 4 values ranging from 1.10−2 to 5.10−1 ). During the evolutionary process, the organisms progressively acquire new genes by duplication and modify them in such a way that the whole gene repertoire fulfills the task the organisms are selected for. 1e+5 mologous and nonhomologous rearrangements. The distribution of the scores of the alignments that led to rearrangements (figure 6) can help us understand this intricate relationship. If we consider this data vertically, we can clearly observe that the proportion of homologous rearrangements is higher when the neighbourhood rate is high. However, as we progress downwards, the distributions behave differently: while it remains nearly unchanged on the left hand side, nonhomologous rearrangements become way more frequent on the right. A noteworthy observation is that there is a great variation in the number of rearrangement events. In fact, it is not the number of nonhomologous rearrangements that raises (it actually remains stable), but rather the number of homologous rearrangements that collapses when the neighbourhood rate decreases. R−squared = 0.830 1e+3 1e+4 Genome Size 1e−03 1e−05 Spontaneous Rearrangement Rate All the simulations proceed qualitatively in a similar way, evolving quickly in the first stage of evolution (rapid gene acquisition mostly by duplication-divergence) then slowing down the process of gene acquisition while optimizing the sequence of existing genes and promoters. In the experiments presented here, the rate at which rearrangements occur is not constant, it depends on both the neighbourhood rate µn and on the presence of repeated sequences on the chromosome. It is hence free to evolve and could well be selected for or against. Yet, despite this added degree of freedom, the rearrangement rate remains a very strong determinant of genome size and content (figure 5). These results confirm those obtained with previous versions of the model in which the rearrangement rates were direct parameters of the model (Knibbe et al., 2007). Even with homologous rearrangements, we find again that the spontaneous rate of rearrangement has a negative impact on fitness (figure 5(d)) because it sets an upper bound on genome size and hence on the number of genes (figure 5(c)). However, rearrangements are also mandatory for evolution to be efficient. An organism whose genome would have lost its capacity to rearrange would hardly be evolvable at all. 1e−07 R−squared = 0.827 0.01 0.02 0.05 0.10 0.20 1e−6 0.50 Neighbourhood rate 1e−4 1e−3 Figure 6: Distribution of the scores of the alignments that caused a rearrangement to occur in the whole population and during the entire evolutionary process, for each value of µn and µm . Light grey: homologous rearrangements, dark grey: nonhomologous rearrangements. For computational performance reasons, the given values are minimal bounds to the corresponding alignment score (cf. Algorithm 2). (b) Genome Size 0.005 0.002 0.001 20 50 Metabolic Error 0.01 100 (a) Rearrangement Rate Number of genes 1e−5 Rearrangement Rate 1e−6 1e−5 1e−4 1e−3 Rearrangement Rate (c) Number of Genes 1e−6 1e−5 1e−4 1e−3 Rearrangement Rate (d) Metabolic Error Figure 5: (a) Average spontaneous rearrangement rates observed for each simulation during the whole evolution. (b,c,d) Genome Size, Genes Number and Metabolic Error of the best organism after 20,000 generations for each simulation, as a function of the spontaneous rearrangement rate. Because homologies are created by rearrangements (duplications) and gradually destroyed by local mutations, there must be some sort of complex interactions between the mutation rate, the neighbourhood rate and the rates of both ho- The underlying phenomenon is best understood when looking at the data in a top-left to bottom-right fashion. One can then identify a phase transition between a regime of mainly homologous rearrangements at high µn and low µm , and a regime of almost exclusively nonhomologous rearrangements at low µn and high µm . In fact, for the possibility of homologous rearrangements to be maintained along the evolutionary process, homologies must be created (by either homologous or nonhomologous duplications) at least as fast as they are destroyed by local mutations. At high neighbourhood rates, this condition is always achieved because rearrangements are numerous. However, at low neighbourhood rates, the damage caused by local mutations can overcome the creation of homologies and stall the whole process. 30000 25000 20000 15000 Best genome 10000 5000 0 The four histograms at the bottom of Figure 6 are hence the most interesting. Within this line, throughout which µn = 1.10−2 , the change in rearrangement mode from mainly nonhomologous to mainly homologous is particularly clear when the spontaneous rate of small mutations decreases. To better understand the dynamics of homologous/nonhomologous rearrangements, we further analysed the simulations from the left hand side, that display both the greatest proportion of homologous rearrangements (within the bottom line) and, interestingly, the best final fitness of all parameter sets. For the three runs of this parameter set (µn = 1.10−2 and µm = 5.10−6 ), we kept track of the family ties during the evolution. We then retrieved the line of ancestry of the final best individual and analyzed the mutational events that occurred on this successful lineage. Except for those that occurred during the very last generations, the events on this lineage are those that went to fixation, either by selection or by genetic drift. In addition, every other 10 generations, we used the standard bioinformatic tool Mummer (Kurtz et al., 2004) to find the most significant repeated sequences in the ancestral genome. Mummer uses an approach similar to that of BLAST, it first searches for exact short repeats and then tries to join them together, allowing for gaps and mismatches. An example of Mummer output is shown in Figure 7. In this example, there are both direct and inverted repeats, and most of the repeated sequences are located in non-coding parts of the genome. This suggests that non coding DNA plays a major role in genome evolvability by providing breakpoints for chromosomal rearrangements. The emergence of repeated sequences having little or no direct impact on fitness has already been observed in genetic programming (Langdon and Banzhaf, 2008) though in that particular case, these repeated sequences could be thought to participate in robustness rather than evolvability. Figure 8 shows the results of the analysis of the whole lineage of ancestors. It shows that fitness improvements are strongly correlated with the presence of repeats in the genome and, consequently, with the occurrence of chromosomal rearrangements. The impact of chromosomal rearrangements on evolvability is thus rather complex: on the one hand, a very high rate of spontaneous rearrangements has a negative impact on the final fitness (Figure 5(d)), but on the other hand, in these simulations where the rate was low and the final fitness high, we find that the presence of rearrangements is correlated with fitness improvement (Figure 8). This suggests that a minimal amount of chromosomal rearrangements is required for evolution to be efficient. A closer look to the rearrangements that went to fixation in these simulations (see Figure 9) reveals that (i) most of the fixed rearrangements were based on homologous breakpoints (score > 40), (ii) most of the fixed translocations and inversions were neutral, (iii) most of the fixed deletions were beneficial and (iv) most of the fixed duplications were deleterious. This last result is surprising at first sight: one would Leading Lagging 0 5000 10000 15000 20000 25000 30000 Best genome Figure 7: Example of Mummer “dot plot” for the best individual at t = 2000 generations, for µn = 10−2 and µm = 5.10−6 , seed 2. Both the x- and the y-axis represent the genome of this individual. Long and strongly similar sequences appear as runs of diagonal lines across the matrix (exact match length = 15 bp, min. cluster length = 200 bp, max. gap between adjacent matches = 6 bp). Grey areas: coding sequences. expect fixed events to be mostly neutral or beneficial. Our hypothesis is that despite their immediate negative impact, duplications can be indirectly selected because they allow for the creation of new gene copies (which can then undergo small mutations and ultimately realize new functions) and new repeats (which can then mediate other rearrangements). Conclusion These experiments of in silico evolution with similaritybased rearrangements confirm our previous results regarding the influence of rearrangements on genome compactness. In large genomes, repeated sequences (located mostly in noncoding regions) promote rearrangements that are, most of the time, deleterious. There is thus an indirect selective pressure to limit the number of rearrangements, which is done by eliminating repeats (fewer homologous rearrangements) and by reducing genome size (fewer nonhomologous rearrangements). However, we have also shown that the absence of rearrangements is correlated with fitness stasis, suggesting that rearrangements can sometimes be directly beneficial or provide appropriate genetic background for subsequent beneficial mutations. A minimal amount of rearrangements is thus required for evolvability. Here, most of the rearrangement kept by evolution are homologous ones. For them to be possible, repeats must be created at least as fast as they are destroyed by small mutations. In the end, the best conditions for evolvability seem to be a small basal rate of nonhomologous rearrangement combined with a low-enough mutation 20000 15000 20000 800 20000 0.04 5000 10000 15000 20000 10000 15000 20000 Generations 0.02 0.00 ● ● duplications deletions translocations inversions 20 40 60 80 Alignment score Figure 9: Analysis of the fixed rearrangements for µn = 10−2 and µm = 5.10−6 (all seeds together). Each point represents a rearrangement that occurred on the line of ancestry of the final best individual. 400 5000 0 ● Generations 0 0 ● ● ● −0.04 0 200 400 15000 deletions translocations inversions ●●● ● ● ●●●● ● ● ● ●● ● ● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ●●●● ● ●●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ●● ●●●●● ● ●●● ●● ● ● ● ●●● ●●● ●● ● ●● ● ●● ●● ●● ●● ●● ●●● ●● ● ●●● ●●● ●● ●●●●●● ●● ●● ●● ● ● ●● ●●● ●● ●● ●● ●● ●●●●●●● ●● ●● ● ●● ● ● ● ●●●● ●●●●●●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ●●●● ●● ● ● ●● ● ● ● ● 150 10000 Generations 200 10000 Generations 20000 50 5000 0 5000 15000 0 0 600 800 600 400 200 0 0 point mutations small insertions small deletions duplications 800 15000 10000 600 10000 Generations 5000 ● ● 100 150 50 0 5000 0 200 deletions translocations inversions ●● ● ● ● −0.02 20000 Impact on metabolic error (negative = beneficial) 5e−02 2e−02 5e−03 point mutations small insertions small deletions duplications 15000 300 10000 100 150 100 50 0 ● 250 5000 200 deletions translocations inversions 0 300 20000 200 250 point mutations small insertions small deletions duplications 15000 250 10000 Kendall's tau = −0.0148, pvalue = 0.0779 ● ● 5e−04 2e−03 5e−02 2e−02 5e−03 5e−04 2e−03 5e−02 5e−03 2e−02 Seed 3 2e−03 300 5000 0 Event count (windows of 500 gener.) 0 Number of Mummer alignments Seed 2 5e−04 Distance to target (log. scale) Seed 1 0 5000 10000 15000 20000 Generations Figure 8: Analysis of the line of ancestry of the final best individual for µn = 10−2 and µm = 5.10−6 . First row: evolution of the fitness (the smaller the distance to the target, the higher the probability of reproduction). Second row: evolution of the number of mutational events, by windows of 500 generations. Third row: number of alignments found by Mummer on the genome (parameters: see Figure 7). rate, thus leading to a few stable repeats and to an intermediate degree of variability by homologous rearrangements. Acknowledgements We gratefully acknowledge support from the CNRS/IN2P3 Computing Center (Lyon/Villeurbanne - France), for providing a significant amount of the computing ressources needed for this work. References Adami, C. (2006). Digital genetics: unravelling the genetic basis of evolution. Nat. Rev. Genet., 7(2):109–118. Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990). Basic local alignment search tool. Journal of Molecular Biology, 215:403–410. Bénichou, O., Coppey, M., Moreau, M., Suet, P. H., and Voituriez, R. (2005). Optimal Search Strategies for Hidden Targets. Physical Review Letters, 94(19):198101+. Beslon, G., Parsons, D., Sanchez-Dehesa, Y., Pena, J., and Knibbe, C. (2010). Scaling laws in bacterial genomes: A side-effect of selection of mutational robustness. BioSystems, 102(1):32– 40. Blickle, T. and Thiele, L. (1996). A comparison of selection schemes used in evolutionary algorithms. Evol. Comput., 4(4):361–394. Dittrich, P., Ziegler, J., and Banzhaf, W. (2001). chemistries-a review. Artif Life, 7(3):225–275. Artificial Earl, D. J. and Deem, M. W. (2004). Evolvability is a selectable trait. Proceedings of the National Academy of Sciences of the United States of America, 101(32):11531–11536. Kirschner, M. and Gerhart, J. (1998). Evolvability. Proceedings of the National Academy of Sciences, 95(15):8420–8427. Knibbe, C., Coulon, A., Mazet, O., Fayard, J.-M., and Beslon, G. (2007). A long-term evolutionary pressure on the amount of noncoding DNA. Mol. Biol. Evol., 24(10):2344–2353. Knibbe, C., Fayard, J.-M., and Beslon, G. (2008). The topology of the protein network influences the dynamics of gene order: from systems biology to a systemic understanding of evolution. Artificial Life, 14(1):149–156. Kurtz, S., Phillippy, A., Delcher, A., Smoot, M., Shumway, M., Antonescu, C., and Salzberg, S. (2004). Versatile and open software for comparing large genomes. Genome Biology, 5:R12. Langdon, W. and Banzhaf, W. (2008). Repeated patterns in genetic programming. Natural Computing, 7(4):589–613. Misevic, D., Ofria, C., and Lenski, R. E. (2006). Sexual reproduction reshapes the genetic architecture of digital organisms. Proc. R. Soc. B., 273(1585):457–464. Neidhardt, F. C. (1996). Escherichia coli and salmonella : cellular and molecular biology. ASM Press. Parsons, D. P., Knibbe, C., and Beslon, G. (2010). Importance of the rearrangement rates on the organization of transcription. In Proceedings of Artificial Life XII, pages 479–486. Wilke, C. O., Wang, J. L., Ofria, C., Lenski, R. E., and Adami, C. (2001). Evolution of digital organisms at high mutation rates leads to survival of the flattest. Nature, 412(6844):331–333.