Download Homologous and Nonhomologous Rearrangements: Interactions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Adaptive evolution in the human genome wikipedia , lookup

Polyploid wikipedia , lookup

Genetic engineering wikipedia , lookup

Designer baby wikipedia , lookup

Gene expression programming wikipedia , lookup

Neocentromere wikipedia , lookup

History of genetic engineering wikipedia , lookup

Public health genomics wikipedia , lookup

Oncogenomics wikipedia , lookup

Segmental Duplication on the Human Y Chromosome wikipedia , lookup

Transposable element wikipedia , lookup

NUMT wikipedia , lookup

Frameshift mutation wikipedia , lookup

Population genetics wikipedia , lookup

Gene wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Genetic code wikipedia , lookup

Metagenomics wikipedia , lookup

Koinophilia wikipedia , lookup

Epistasis wikipedia , lookup

Genome (book) wikipedia , lookup

Multiple sequence alignment wikipedia , lookup

Minimal genome wikipedia , lookup

Genomic library wikipedia , lookup

Sequence alignment wikipedia , lookup

Mutation wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Non-coding DNA wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Metabolic network modelling wikipedia , lookup

Human Genome Project wikipedia , lookup

Pathogenomics wikipedia , lookup

Genomics wikipedia , lookup

Human genome wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Helitron (biology) wikipedia , lookup

Point mutation wikipedia , lookup

Microevolution wikipedia , lookup

Genome editing wikipedia , lookup

Genome evolution wikipedia , lookup

Transcript
Homologous and Nonhomologous Rearrangements:
Interactions and Effects on Evolvability
David P. Parsons1,3 , Carole Knibbe2,3 and Guillaume Beslon1,3
1
2
Université de Lyon, CNRS, INRIA, INSA-Lyon, LIRIS, UMR5205, F-69621, France
Université de Lyon, CNRS, INRIA, Université Lyon 1, LIRIS, UMR5205, F-69622, France
3
IXXI, Institut Rhône-Alpin des Systèmes Complexes, Lyon, F-69007, France
[email protected]
Abstract
By using Aevol, a simulation framework designed to study
the evolution of genome structure, we investigate the effect
of homologous rearrangements on the course of evolution.
We designed an efficient model of rearrangements based on
an intermittent search algorithm. Then, using experimental in
silico evolution, we explore the effect of rearrangement rates
on the genome structure. We show that the effect of homologous rearrangements is quite complex. At first glance they
appear to be dangerous enough to trigger an indirect selective
pressure leading to short genomes when the rearrangement
rate is high. However, by analyzing the successful lineage
in the best runs, we found that there is a positive correlation
between the number of homologous rearrangements and the
fitness improvement in these lineages. Thus the impact of
homologous rearrangements on evolution is rather complex:
dangerous on the one hand but necessary on the other hand,
to ensure a sufficient level of evolvability to the organisms.
Moreover, our results show that the spontaneous rate of small
mutations influences the relative proportions of homologous
versus nonhomologous rearrangements.
Introduction
Chromosomal rearrangements are known to play a major role in evolution. Their most visible effects are quite
straightforward: duplications and deletions account for numerous gene acquisitions or losses while translocations and
inversions have a direct influence on gene order. However, these direct effects are flanked by other indirect selective pressures. The rates and mechanisms of rearrangements indeed influence the evolvability (Kirschner and Gerhart, 1998) of the lineage and, as it was stated by Earl and
Deem (2004), evolvability itself can be subject to evolution.
In the long term, more evolvable lineages are more likely
to produce beneficial mutations and hence to overcome lineages with lower evolvability. Similarly, Wilke et al. (2001)
showed a second-order selective pressure on mutational robustness. The selection of a specific level of evolvability or
robustness is said to be indirect because they do not influence the fitness of the organism, but that of its descendants.
Unraveling these second-order pressures is a very challenging matter. Indeed, the underlying processes are complex and act on a very long time scale. It is hence difficult
to tackle such questions either in vivo or in vitro. Comparative genomics approaches are a way to circumvent this difficulty. However, they are based upon the static snapshots
of the contemporary sequences and have to infer their evolutionary past.
Artificial life and in silico simulations are very useful in
such cases, providing us with insights into complex mechanisms and shedding light onto second-order pressures that
would have been difficult to identify otherwise (Wilke et al.,
2001; Adami, 2006; Misevic et al., 2006; Knibbe et al.,
2007; Beslon et al., 2010). They offer a dynamic view of the
evolutionary process and provide the experimentalist with a
very good control over parameters as well as a perfect fossil
record throughout the evolution.
The Aevol model was developed specifically to study
the evolution of genome structure. Experiments using this
model underlined the major importance of chromosomal rearrangements in the evolutionary process. For a start, we observed that in total absence of chromosomal rearrangements,
evolution can hardly occur at all because gene duplications
are necessary to acquire new genes and thus new functions.
Secondly, it has been shown that, because of rearrangements, non-coding sequences can become mutagenic for the
surrounding genes. The consequence is a clear trend for organisms having evolved under high rearrangement rates to
own shorter and denser genomes than those having evolved
under lower rates of rearrangement (Knibbe et al., 2007; Parsons et al., 2010). As we have already shown, this effect is
the consequence of the long-term selection of a specific level
of mutational variability (Knibbe et al., 2007).
Unlike point mutations and indels that produce local variations, chromosomal rearrangements can involve huge sequences and turn a very fit individual into an ill-adapted one
in a single event. Chromosomal rearrangements can hence
be very dangerous. However, rearrangements are usually
not fully random. Most rearrangements are the consequence
of error-repair mechanisms such as the RecA mediated double strand break repair mechanism (Neidhardt, 1996). These
mechanisms usually require that the sequences be similar (at
least around the breakpoints) to be rearranged. Such rear-
rangements based on sequence similarity are called homologous rearrangements. By contrast, we call here nonhomologous rearrangements those that occur between sequences of
low similarity. It is tempting to think that, because they are
partially directed, homologous rearrangements could be less
dangerous than rearrangements occurring at random points.
To investigate the role of homologous rearrangements in
genome evolution, we modified the Aevol model to introduce a sensitivity to sequence similarity in the rearrangement process: a rearrangement is now more likely to occur
between similar sequences (homologous recombination) but
remains possible, although at a low probability, when the
breakpoints differ (nonhomologous recombination).
After an overall presentation of the Aevol model, focusing particularly on the way we take sequence homologies
into account in the rearrangement process, we will present
our results regarding the different effects of homologous and
nonhomologous rearrangements. We will discuss the intricate relationship that exists between homologous and nonhomologous rearrangements, and their impact on evolvability.
Aevol: A digital genetics model
The Aevol model was developed in our team to study the
evolution of genome structure. It simulates the evolution of
a population of N artificial haploid organisms with flexible
genomes. Although a description of the model has already
been published (see Knibbe et al. (2008) and its supp. mat.),
we thereafter provide an overview of the most important
principles that are necessary to have a good understanding
of the results presented here.
In Aevol, each artificial organism owns a genome whose
structure is inspired by prokaryotic genomes. It is organized
as a circular double-strand binary string containing a variable number of genes separated by non-coding sequences
(figure 1). Genes are identified and decoded thanks to an
explicit transcription-translation process based upon predefined signaling sequences. Then, an abstract “folding” process gives rise to artificial “proteins” that are able to realize or deflect a particular range of abstract “biological functions”. The interaction of all these proteins yields the set of
functions the organism is able to perform, which will in turn
be compared to an environmental target to determine how
well-adapted this individual is.
At each generation, N new individuals are created by reproducing preferentially the best individuals of the parental
generation which is then completely replaced. During the
replication process, the chromosome can undergo different
kinds of modifications: local mutations (point mutations,
small insertions and small deletions), but also large chromosomal rearrangements (duplications, deletions, translocations and inversions). At the beginning of the run, all the organisms are initialized with the same random sequence (of
5,000 base-pairs here) which contains at least one gene.
Promoter
Double stranded genome
with scattered genes
Shine-Dalgarno
START
Coding DNA Sequence
STOP
Terminator
Figure 1: In Aevol, each individual owns a circular doublestranded binary genome upon which coding sequences are
identified thanks to predefined signalling sequences: promoters and terminators mark the boundaries of transcribed
sequences and, inside these transcribed regions, coding sequences can exist between a S TART signal and an in-frame
S TOP codon (see figure 2 for the genetic code).
From genotype to phenotype
Transcription In prokaryotes, transcription initiates at
particular sites, called promoters, where the RNApolymerases recognize a consensus sequence to which they
can bind and begin the RNA synthesis. In Aevol, we defined a long consensus sequence, a promoter being a sequence whose Hamming distance d with this consensus is
less than or equal to dmax . In the experiments presented
here, the consensus was a 22-base-pairs (bp) sequence and
up to dmax = 4 mismatches were allowed. This consensus
sequence is long enough to ensure that random, non-coding
sequences have a low probability to become coding by a single mutation event.
When a promoter is found, the transcription goes on until a terminator is reached. We defined terminators as sequences that would be able to form a stem-loop structure,
as the ρ-independent bacterial terminators do. In these experiments, the stem size was set to 4 and the loop size to 3,
terminators thus had the following structure: abcd ∗ ∗ ∗ dcba,
where a, b, c, d = 0 or 1.
The expression level e of an RNA is determined according
d
to its promoter sequence: e = 1 − dmax
+1 . This modulation
of the expression level models in a simplified way the basal
interaction of the RNA polymerase with the promoter, without additional regulation. It provides duplicated genes with
a way to reduce temporarily their phenotypic contribution
while diverging toward other functions.
Translation Transcribed sequences (RNAs) do not necessarily result in a protein. The translation process of
an RNA takes place when a Shine-Dalgarno-like sequence
is found, followed, a few base-pairs away, by a S TART
codon (see genetic code on figure 2). Whenever this signal 011011****000 is found, the following sequence is read
three bases (one codon) at a time until the S TOP codon (001)
is found on the same reading frame. Each codon lying be-
foreach Generation do
// Evaluation
foreach Individual do
Identify coding sequences
foreach CodingSequence do
Translate into abstract protein
end
Compute phenotype by combining protein
contributions
Compute fitness by comparing the phenotype to
the environmental target
end
// Selection
Sort the individuals by fitness
Compute the probabilities of reproduction
Draw the actual numbers of offspring
// Reproduction
foreach Individual do
foreach Offspring do
Do Rearrangements
Do Local Mutations
end
end
Replace current population
end
Algorithm 1: Aevol General Algorithm
tween the initiation and termination signals is translated into
an abstract “Amino-Acid” using an artificial genetic code,
therefore giving rise to the protein’s primary sequence (figure 2). As in real organisms, genes can be found on six
different reading frames (three on each strand), giving the
possibility for the organisms to evolve overlapping genes,
which are commonly found in virus and bacteria.
Protein “folding” and phenotype computation To
model the activity of proteins and the resulting phenotype,
we defined a simple “artificial chemistry” (Dittrich et al.,
2001) that describes the organism’s metabolism in a mathematical language. In our simplified artificial world, we assume that there is an abstract, one-dimensional space Ω =
[0, 1] representing all the possible metabolic processes (that
is, in this model, a metabolic process is just a real number).
In this “metabolic space”, each protein is involved in a subset of processes (either realizing it or preventing other proteins from realizing it) which is described using the fuzzy set
formalism: a given protein can be involved in a metabolic
process with a possibility degree lying between 0 and 1. A
protein is thus fully characterized by a mathematical function that associates a possibility degree to each metabolic
process, describing the fuzzy subset of metabolic processes
it is involved in. For simplicity, we use piecewise-linear
functions with a symmetric, triangular shape (figure 2). In
Shine-Dalgarno
Coding sequence
START
STOP
5’ UTR
Promoter
…001…0101…0110…0010…0110110011000101111011101110011010001…
…100…1010…1001…1101…1001001100111010000100010001100101110…
Expression
level = e
Genetic code
000
001
100
101
010
011
110
111
START
STOP
M0
M1
W0
W1
H0
H1
M1-H1-W1-M1-H0-W1-W0
Bin code M :
Bin code W :
Bin code H :
11
110
10
Norm.
0,66
0,07
0,33
Possibility
degree
e.|h|
M
Function
W
Figure 2: Overview of the transcription-translation-folding
process in Aevol. Transcribed sequences are those that start
with a promoter (consensus sequence) and end with a terminator sequence (stem-loop structure), not shown on the
figure. Coding sequences (genes) are searched within the
transcribed sequences; They begin with a Shine-DalgarnoS TART sequence and end with a S TOP codon. An artificial
genetic code (right) is used to convert a gene into the primary sequence of the corresponding protein and a “folding
process” enables us to compute the metabolic activity of this
protein (functional abilities).
this way, only three numbers are needed to characterize the
metabolic activity of a protein: the position m (m ∈ Ω) of
the triangle on the axis, its half-width w and its height h
(positive when realizing a function, negative when inhibiting it). This means that the protein contributes to the range
[m−w, m+w] of metabolic processes, with a preference for
the processes closest to m (for which the highest efficiency,
h, is reached). Thus, various types of proteins can co-exist,
from highly efficient and specialized ones (small w, high h)
to polyvalent but poorly efficient ones (large w, low h).
In this framework, each protein’s primary sequence is decomposed into three interlaced binary subsequences that will
in turn be interpreted as the values for the m, w and h parameters. For instance, the codon 010 (resp. 011) is translated
into the single amino acid W 0 (resp. W 1), which means
that it contributes to the value of w by adding a bit 0 (resp.
1) to its binary code. Small mutations in the coding sequence
(point mutations, indels, possibly causing frame shifts) will
change these parameters, resulting in a modification of the
protein’s metabolic activity.
Once all the proteins encoded on the genotype of the organism have been identified and characterized, their activities are combined into a fuzzy set representing the individual’s phenotype P = (∪Ai ) ∩ (∪Ij ), using Lucasiewicz’
fuzzy operators, with Ai being the fuzzy subset of the i-th
activating protein (hi > 0) and Ij the fuzzy subset of the
j-th inhibiting protein (hj < 0). Intuitively, this means that
metabolic processes achieved by the organism are those that
are activated and not inhibited. The phenotypic fuzzy set
P indicates to what extent the individual can realize each
metabolic process in our abstract metabolic space.
Environment, adaptation and selection
In Aevol, the environment is represented by a phenotypic
target: the fuzzy set E defined on Ω that represents the optimal degree of possibility for each “biological function”.
To evaluate an individual, we compare its phenotype P to
the optimal phenotype E. The “metabolic error” g is computed as the geometric area between these two sets (figure
3). The lower the metabolic error, the better the individual.
This measure penalizes both the under-realization and the
over-realization of each function.
The rates at which each type of local mutation occurs
are parameters of the model. They are defined as the perbase, per-replication probability of each type of mutation
to take place. The chromosomal rearrangement rates however, can not be a direct parameter of the model. Indeed, in
this version of the model, a rearrangement is all the more
likely to occur that the sequences at the breakpoints are
similar. The probability of a chromosomal rearrangement
to occur hence depends on the sequence itself and consequently, is subject to evolution. Details about how we modeled these homology-driven chromosomal rearrangements
are provided in the next section.
Genetic exchange (crossover) between individuals was
not allowed in the simulations presented here, because we
first needed to assess the impact of similarity-based intrachromosomal rearrangements in the simple case of an asexual population. We plan to allow for similarity-based genetic
exchange in future experiments.
Homology-driven chromosomal
rearrangements
Figure 3: Measure of individual adaptation. Dashed curve:
environmental target E. Solid curve: phenotypic distribution P (resulting metabolic profile obtained after combining
all the proteins). Dark grey filled area: metabolic error g.
In the current version of Aevol, the population size is constant (here N = 1, 000 individuals) and the population is
entirely renewed at each generation. A probability of reproduction is assigned to each individual according to its
metabolic error and a multinomial drawing determines the
actual number of offsprings each individual will have. In the
experiments presented here, we used an exponential ranking
selection (Blickle and Thiele, 1996). The individuals are
sorted by decreasing metabolic error so that the worst individual has rank r = 1 and the best r = N . The probability
N −r
of reproduction of an individual is then given by ss−1
,
N −1 s
with s = 0, 998 being the intensity of selection in all the experiments presented here.
Genetic operators
During their replication, genomes can undergo different
modifications: local mutations (point mutations, insertions
or deletions of 1 to 6 bp) and chromosomal rearrangements
(duplications, deletions, translocations, inversions).
Mutations and rearrangements affect the genome but do
not necessarily have a phenotypic effect. For instance, a
mutation that takes place in an untranscribed region will be
completely neutral unless it creates a new promoter, which
is reasonably rare given the size of the consensus sequence.
Taking homologies into account in the chromosomal rearrangement process requires some knowledge regarding sequence repeats on the chromosome. A naive approach would
be to compute a complete alignment search of the genome
on itself and then to proceed to the rearrangements if any.
However, searching for alignments between sequences is
known to be a computationally costly problem. In our particular case, where we deal with millions of genomes (classically 1,000 genomes per generation for thousands of generations), even a heuristic search such as BLAST (Altschul
et al., 1990) would be forbiddingly long to compute. Another possible approach, chosen here, is to use intermittent
searches (Bénichou et al., 2005), that provide us with a partial yet sufficient knowledge of sequence alignments within
the genome.
In bacteria, several mechanisms can result in a rearranged
chromosome. All these mechanisms have a basic prerequisite of spatial proximity: two sequences must be physically
close together in the cytoplasm, at least at the breakpoints,
for them to rearrange. As the chromosome is supercoiled,
two sequences that are very distant from each other on the
chromosome can very well be next to each other in the threedimensional conformation. Since the mechanisms that constrain the spatial conformation of the genome according to
its sequence are still poorly understood in bacteria, here we
simply picked random pairs of sequences on the genome and
consider them to be neighbours.
How many pairs of points are to be drawn depends on
both the genome length and its degree of supercoiling. Consider any given sequence on the genome. The number of
other sequences that are localized in its surroundings depends on how densely packed the genome is. In a highly
supercoiled genome, for instance, all the sequences are very
1.0
0.6
0.4
0.2
Probability
0.8
max_shi1 « Good » alignment found 0.0
Local alignment search zones Sequence 2 Whole Genome Candidate pair of points Whole Genome (a) Global View
0
Sequence 1 20
40
60
80
100
Alignment Score
(b) Zoom on Local Search
(c) Alignment probabilities
Figure 4: (a) For each pair of points that are candidate for a rearrangement to occur, a local alignment search is performed
between the surrounding sequences either in direct or indirect sense. (b) The searching zone is defined by 2 parameters: the
half length of the searching zone and the maximum slippage max shift authorized between the sequences. In the experiments
presented in this paper, we used values of respectively 50 for half length and 20 for max shift. (c) Solid line: probability
to find a sequence of the given score on a random sequence. Dashed line: the function prear (score) used to map scores to
rearrangement probabilities in our experiments.
tightly packed together so any sequence has many neighbours and thus many rearrangement opportunities. We thus
introduced a specific parameter in the model, the “neighbourhood rate” (µn ), that expresses this degree of supercoiling. The number of pair of points to consider for a possible rearrangement will then be given by L ∗ µn , with L, the
genome length in bp. Here, µn is a parameter defined for the
whole population and cannot change during its evolution.
For each candidate pair of points, a basic local alignment
search will be performed to determine the existence of similarities between the surrounding sequences either in a direct
or indirect sense (figure 4(a)). To that end, we defined a simple scoring function (+1 per match, -2 per mismatch) that
allows us to quantify the similarity of two sequences1 , and
associated each score to a probability of rearrangement. The
kind and number of rearrangements are computed thanks to
algorithm 2.
Preliminary experiments allowed us to adjust the function prear (score), that maps alignement scores to probabilities of rearrangement. To favour homologous over nonhomologous rearrangements, alignment scores that are seldom found on random sequences (high scores) are associated with very high rearrangement probabilities (homologous rearrangements). Low score alignments on the other
hand, are likely to result from contingency, and will hence
be given low probabilities of rearrangement (nonhomologous rearrangements). Figure 4(c) shows the probability
of finding an alignment of a given score on a random sequence as well as the function prear (score) we used in the
following experiments. This particular function yields a reasonable tradeoff between homologous and nonhomologous
1
Even though it is possible to allow for gaps within alignments,
the computation cost would be too important. Hence, in the experiments presented here, no gaps were allowed.
rearrangements.
initial nb pairs ← L ∗ µn
nb pairs ← initial nb pairs
while nb pairs > 0 do
Draw 2 random positions pos1 and pos2
Draw type of rearrangement
if Inversion then sense ← indirect
else sense ← direct
Draw minimal alignment score using p−1
rear
Search Alignment(pos1, pos2, sense, min score)
if Alignment found then
Proceed to Rearrangement
Update L
end
nb pairs ← nb pairs − 1
nb pairs
nb pairs ← initial
nb pairs ∗ L ∗ µn
end
Algorithm 2: Aevol Rearrangement Process Algorithm
Results
Our model being quite complex, our experimental methods
are very similar to those used in “wet” experimental evolution. We let 60 populations of 1,000 asexual individuals
evolve during 20,000 generations in near identical conditions where the only changing parameters were the mutation
rate (one common rate µm for the three different types of local mutations, 4 values ranging from 5.10−6 to 1.10−4 were
tested) and the neighbourhood rate (µn , 4 values ranging
from 1.10−2 to 5.10−1 ). During the evolutionary process,
the organisms progressively acquire new genes by duplication and modify them in such a way that the whole gene
repertoire fulfills the task the organisms are selected for.
1e+5
mologous and nonhomologous rearrangements.
The distribution of the scores of the alignments that led
to rearrangements (figure 6) can help us understand this intricate relationship. If we consider this data vertically, we
can clearly observe that the proportion of homologous rearrangements is higher when the neighbourhood rate is high.
However, as we progress downwards, the distributions behave differently: while it remains nearly unchanged on the
left hand side, nonhomologous rearrangements become way
more frequent on the right. A noteworthy observation is
that there is a great variation in the number of rearrangement events. In fact, it is not the number of nonhomologous
rearrangements that raises (it actually remains stable), but
rather the number of homologous rearrangements that collapses when the neighbourhood rate decreases.
R−squared = 0.830
1e+3
1e+4
Genome Size
1e−03
1e−05
Spontaneous Rearrangement Rate
All the simulations proceed qualitatively in a similar way,
evolving quickly in the first stage of evolution (rapid gene
acquisition mostly by duplication-divergence) then slowing
down the process of gene acquisition while optimizing the
sequence of existing genes and promoters.
In the experiments presented here, the rate at which rearrangements occur is not constant, it depends on both the
neighbourhood rate µn and on the presence of repeated sequences on the chromosome. It is hence free to evolve and
could well be selected for or against. Yet, despite this added
degree of freedom, the rearrangement rate remains a very
strong determinant of genome size and content (figure 5).
These results confirm those obtained with previous versions
of the model in which the rearrangement rates were direct
parameters of the model (Knibbe et al., 2007). Even with
homologous rearrangements, we find again that the spontaneous rate of rearrangement has a negative impact on fitness
(figure 5(d)) because it sets an upper bound on genome size
and hence on the number of genes (figure 5(c)). However,
rearrangements are also mandatory for evolution to be efficient. An organism whose genome would have lost its capacity to rearrange would hardly be evolvable at all.
1e−07
R−squared = 0.827
0.01
0.02
0.05
0.10
0.20
1e−6
0.50
Neighbourhood rate
1e−4
1e−3
Figure 6: Distribution of the scores of the alignments that
caused a rearrangement to occur in the whole population
and during the entire evolutionary process, for each value of
µn and µm . Light grey: homologous rearrangements, dark
grey: nonhomologous rearrangements. For computational
performance reasons, the given values are minimal bounds
to the corresponding alignment score (cf. Algorithm 2).
(b) Genome Size
0.005
0.002
0.001
20
50
Metabolic Error
0.01
100
(a) Rearrangement Rate
Number of genes
1e−5
Rearrangement Rate
1e−6
1e−5
1e−4
1e−3
Rearrangement Rate
(c) Number of Genes
1e−6
1e−5
1e−4
1e−3
Rearrangement Rate
(d) Metabolic Error
Figure 5: (a) Average spontaneous rearrangement rates
observed for each simulation during the whole evolution.
(b,c,d) Genome Size, Genes Number and Metabolic Error
of the best organism after 20,000 generations for each simulation, as a function of the spontaneous rearrangement rate.
Because homologies are created by rearrangements (duplications) and gradually destroyed by local mutations, there
must be some sort of complex interactions between the mutation rate, the neighbourhood rate and the rates of both ho-
The underlying phenomenon is best understood when
looking at the data in a top-left to bottom-right fashion.
One can then identify a phase transition between a regime
of mainly homologous rearrangements at high µn and low
µm , and a regime of almost exclusively nonhomologous rearrangements at low µn and high µm . In fact, for the possibility of homologous rearrangements to be maintained along
the evolutionary process, homologies must be created (by either homologous or nonhomologous duplications) at least as
fast as they are destroyed by local mutations. At high neighbourhood rates, this condition is always achieved because
rearrangements are numerous. However, at low neighbourhood rates, the damage caused by local mutations can overcome the creation of homologies and stall the whole process.
30000
25000
20000
15000
Best genome
10000
5000
0
The four histograms at the bottom of Figure 6 are hence
the most interesting. Within this line, throughout which
µn = 1.10−2 , the change in rearrangement mode from
mainly nonhomologous to mainly homologous is particularly clear when the spontaneous rate of small mutations
decreases. To better understand the dynamics of homologous/nonhomologous rearrangements, we further analysed
the simulations from the left hand side, that display both the
greatest proportion of homologous rearrangements (within
the bottom line) and, interestingly, the best final fitness of
all parameter sets. For the three runs of this parameter set
(µn = 1.10−2 and µm = 5.10−6 ), we kept track of the family ties during the evolution. We then retrieved the line of
ancestry of the final best individual and analyzed the mutational events that occurred on this successful lineage. Except
for those that occurred during the very last generations, the
events on this lineage are those that went to fixation, either
by selection or by genetic drift. In addition, every other 10
generations, we used the standard bioinformatic tool Mummer (Kurtz et al., 2004) to find the most significant repeated
sequences in the ancestral genome. Mummer uses an approach similar to that of BLAST, it first searches for exact
short repeats and then tries to join them together, allowing
for gaps and mismatches. An example of Mummer output is
shown in Figure 7. In this example, there are both direct and
inverted repeats, and most of the repeated sequences are located in non-coding parts of the genome. This suggests that
non coding DNA plays a major role in genome evolvability
by providing breakpoints for chromosomal rearrangements.
The emergence of repeated sequences having little or no direct impact on fitness has already been observed in genetic
programming (Langdon and Banzhaf, 2008) though in that
particular case, these repeated sequences could be thought
to participate in robustness rather than evolvability.
Figure 8 shows the results of the analysis of the whole
lineage of ancestors. It shows that fitness improvements
are strongly correlated with the presence of repeats in the
genome and, consequently, with the occurrence of chromosomal rearrangements. The impact of chromosomal rearrangements on evolvability is thus rather complex: on the
one hand, a very high rate of spontaneous rearrangements
has a negative impact on the final fitness (Figure 5(d)), but
on the other hand, in these simulations where the rate was
low and the final fitness high, we find that the presence of
rearrangements is correlated with fitness improvement (Figure 8). This suggests that a minimal amount of chromosomal
rearrangements is required for evolution to be efficient.
A closer look to the rearrangements that went to fixation
in these simulations (see Figure 9) reveals that (i) most of
the fixed rearrangements were based on homologous breakpoints (score > 40), (ii) most of the fixed translocations and
inversions were neutral, (iii) most of the fixed deletions were
beneficial and (iv) most of the fixed duplications were deleterious. This last result is surprising at first sight: one would
Leading
Lagging
0
5000
10000
15000
20000
25000
30000
Best genome
Figure 7: Example of Mummer “dot plot” for the best individual at t = 2000 generations, for µn = 10−2 and
µm = 5.10−6 , seed 2. Both the x- and the y-axis represent
the genome of this individual. Long and strongly similar sequences appear as runs of diagonal lines across the matrix
(exact match length = 15 bp, min. cluster length = 200 bp,
max. gap between adjacent matches = 6 bp). Grey areas:
coding sequences.
expect fixed events to be mostly neutral or beneficial. Our
hypothesis is that despite their immediate negative impact,
duplications can be indirectly selected because they allow
for the creation of new gene copies (which can then undergo
small mutations and ultimately realize new functions) and
new repeats (which can then mediate other rearrangements).
Conclusion
These experiments of in silico evolution with similaritybased rearrangements confirm our previous results regarding
the influence of rearrangements on genome compactness. In
large genomes, repeated sequences (located mostly in noncoding regions) promote rearrangements that are, most of
the time, deleterious. There is thus an indirect selective pressure to limit the number of rearrangements, which is done by
eliminating repeats (fewer homologous rearrangements) and
by reducing genome size (fewer nonhomologous rearrangements). However, we have also shown that the absence of
rearrangements is correlated with fitness stasis, suggesting
that rearrangements can sometimes be directly beneficial or
provide appropriate genetic background for subsequent beneficial mutations. A minimal amount of rearrangements is
thus required for evolvability. Here, most of the rearrangement kept by evolution are homologous ones. For them to be
possible, repeats must be created at least as fast as they are
destroyed by small mutations. In the end, the best conditions
for evolvability seem to be a small basal rate of nonhomologous rearrangement combined with a low-enough mutation
20000
15000
20000
800
20000
0.04
5000
10000
15000
20000
10000
15000
20000
Generations
0.02
0.00
●
●
duplications
deletions
translocations
inversions
20
40
60
80
Alignment score
Figure 9: Analysis of the fixed rearrangements for µn =
10−2 and µm = 5.10−6 (all seeds together). Each point represents a rearrangement that occurred on the line of ancestry
of the final best individual.
400
5000
0
●
Generations
0
0
●
●
●
−0.04
0
200
400
15000
deletions
translocations
inversions
●●● ●
● ●●●●
●
●
●
●●
●
● ● ●●
●
● ●● ●● ●
●
● ●
●
● ●
● ● ● ●● ●●
● ●
●
● ●● ●●
● ● ●
● ● ●●
●
●●
●
●
● ●
●●●●
●
●●●● ●
●
●
●
●● ● ●
●
● ●
● ● ● ● ● ●●
●● ● ● ●
●
●
●
●
● ●●
● ●●
●●●●●
●
●●● ●●
●
● ● ●●●
●●●
●●
●
●●
●
●●
●●
●●
●●
●●
●●●
●●
●
●●● ●●● ●● ●●●●●● ●●
●●
●●
●
●
●●
●●●
●●
●●
●●
●●
●●●●●●●
●●
●●
●
●●
●
●
●
●●●● ●●●●●●●●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●● ●● ● ● ● ●●●● ●●
●
●
●●
●
● ●
●
150
10000
Generations
200
10000
Generations
20000
50
5000
0
5000
15000
0
0
600
800
600
400
200
0
0
point mutations
small insertions
small deletions
duplications
800
15000
10000
600
10000
Generations
5000
●
●
100
150
50
0
5000
0
200
deletions
translocations
inversions
●●
●
●
●
−0.02
20000
Impact on metabolic error
(negative = beneficial)
5e−02
2e−02
5e−03
point mutations
small insertions
small deletions
duplications
15000
300
10000
100
150
100
50
0
●
250
5000
200
deletions
translocations
inversions
0
300
20000
200
250
point mutations
small insertions
small deletions
duplications
15000
250
10000
Kendall's tau = −0.0148,
pvalue = 0.0779
●
●
5e−04
2e−03
5e−02
2e−02
5e−03
5e−04
2e−03
5e−02
5e−03
2e−02
Seed 3
2e−03
300
5000
0
Event count (windows of 500 gener.)
0
Number of Mummer alignments
Seed 2
5e−04
Distance to target (log. scale)
Seed 1
0
5000
10000
15000
20000
Generations
Figure 8: Analysis of the line of ancestry of the final best
individual for µn = 10−2 and µm = 5.10−6 . First row:
evolution of the fitness (the smaller the distance to the target, the higher the probability of reproduction). Second row:
evolution of the number of mutational events, by windows of
500 generations. Third row: number of alignments found by
Mummer on the genome (parameters: see Figure 7).
rate, thus leading to a few stable repeats and to an intermediate degree of variability by homologous rearrangements.
Acknowledgements
We gratefully acknowledge support from the CNRS/IN2P3
Computing Center (Lyon/Villeurbanne - France), for providing a significant amount of the computing ressources needed
for this work.
References
Adami, C. (2006). Digital genetics: unravelling the genetic basis
of evolution. Nat. Rev. Genet., 7(2):109–118.
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman,
D. J. (1990). Basic local alignment search tool. Journal of
Molecular Biology, 215:403–410.
Bénichou, O., Coppey, M., Moreau, M., Suet, P. H., and Voituriez,
R. (2005). Optimal Search Strategies for Hidden Targets.
Physical Review Letters, 94(19):198101+.
Beslon, G., Parsons, D., Sanchez-Dehesa, Y., Pena, J., and Knibbe,
C. (2010). Scaling laws in bacterial genomes: A side-effect of
selection of mutational robustness. BioSystems, 102(1):32–
40.
Blickle, T. and Thiele, L. (1996). A comparison of selection
schemes used in evolutionary algorithms. Evol. Comput.,
4(4):361–394.
Dittrich, P., Ziegler, J., and Banzhaf, W. (2001).
chemistries-a review. Artif Life, 7(3):225–275.
Artificial
Earl, D. J. and Deem, M. W. (2004). Evolvability is a selectable
trait. Proceedings of the National Academy of Sciences of the
United States of America, 101(32):11531–11536.
Kirschner, M. and Gerhart, J. (1998). Evolvability. Proceedings of
the National Academy of Sciences, 95(15):8420–8427.
Knibbe, C., Coulon, A., Mazet, O., Fayard, J.-M., and Beslon, G.
(2007). A long-term evolutionary pressure on the amount of
noncoding DNA. Mol. Biol. Evol., 24(10):2344–2353.
Knibbe, C., Fayard, J.-M., and Beslon, G. (2008). The topology
of the protein network influences the dynamics of gene order:
from systems biology to a systemic understanding of evolution. Artificial Life, 14(1):149–156.
Kurtz, S., Phillippy, A., Delcher, A., Smoot, M., Shumway, M.,
Antonescu, C., and Salzberg, S. (2004). Versatile and open
software for comparing large genomes. Genome Biology,
5:R12.
Langdon, W. and Banzhaf, W. (2008). Repeated patterns in genetic
programming. Natural Computing, 7(4):589–613.
Misevic, D., Ofria, C., and Lenski, R. E. (2006). Sexual reproduction reshapes the genetic architecture of digital organisms.
Proc. R. Soc. B., 273(1585):457–464.
Neidhardt, F. C. (1996). Escherichia coli and salmonella : cellular
and molecular biology. ASM Press.
Parsons, D. P., Knibbe, C., and Beslon, G. (2010). Importance of
the rearrangement rates on the organization of transcription.
In Proceedings of Artificial Life XII, pages 479–486.
Wilke, C. O., Wang, J. L., Ofria, C., Lenski, R. E., and Adami, C.
(2001). Evolution of digital organisms at high mutation rates
leads to survival of the flattest. Nature, 412(6844):331–333.