Download What happened to my genes? Insights on gene family dynamics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Epigenetics of human development wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Transposable element wikipedia , lookup

Non-coding DNA wikipedia , lookup

Public health genomics wikipedia , lookup

Human genome wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Copy-number variation wikipedia , lookup

Genetic engineering wikipedia , lookup

Genomics wikipedia , lookup

Frameshift mutation wikipedia , lookup

Koinophilia wikipedia , lookup

History of genetic engineering wikipedia , lookup

Pathogenomics wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Oncogenomics wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Population genetics wikipedia , lookup

Gene therapy wikipedia , lookup

Saethre–Chotzen syndrome wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression profiling wikipedia , lookup

Epistasis wikipedia , lookup

Mutation wikipedia , lookup

The Selfish Gene wikipedia , lookup

Gene nomenclature wikipedia , lookup

Genome (book) wikipedia , lookup

Gene wikipedia , lookup

Gene desert wikipedia , lookup

Gene expression programming wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Genome editing wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Genome evolution wikipedia , lookup

Point mutation wikipedia , lookup

Designer baby wikipedia , lookup

Helitron (biology) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Microevolution wikipedia , lookup

Transcript
DOI: http://dx.doi.org/10.7551/978-0-262-32621-6-ch006
What happened to my genes?
Insights on gene family dynamics from digital genetics experiments
C. Knibbe1,2 and D. P. Parsons1
1
2
INRIA Rhône-Alpes, Montbonnot F-38322, France
Université de Lyon, Université Lyon 1, CNRS, UMR5205, LIRIS, Villeurbanne, F-69622, France
[email protected]
Abstract
Gene families are sets of homologous genes formed by duplications of a single original gene. Inferring their history in
terms of gene duplications, gene losses and gene mutations
yields fundamental insights into the molecular basis of evolution. However, phylogenetic inference of gene family evolution faces two difficulties: (i) the delimitation of gene families based on sequence similarity, and (ii) the fact that the
models of evolution used for reconstruction are tested against
simulated data that are produced by the model itself. Here,
we show that digital genetics, or in silico experimental evolution, can provide thought-provoking synthetic gene family
data, robust to rearrangements in gene sequences and, most
importantly, not biased by where and how we think natural
selection should act. Using aevol, a digital genetics model
with an abstract phenotype but a realistic genome structure,
we analyzed the evolution of 3,512 synthetic gene families
under directional selection. The turnover of gene families
in evolutionary runs was such that only 21% of those families would be accessible for classical phylogenetic inference.
Extinct families showed patterns different from the final, observable ones, both in terms of dynamics of gene gains and
losses and in terms of gene sequence evolution. This study
also reveals that gene sequence evolution, and thus evolutionary innovation, occurred not only through local mutations, but
also through chromosomal rearrangements that re-assembled
parts of existing genes.
Introduction
How do new genes arise? Do they evolve mainly by local
mutations or domain shuffling? Which events drive them to
extinction? These questions are fundamental to understand
the evolutionary dynamics of living systems. Because the
preservation of soft tissues is rare in fossil records, paleontology provides precious but limited knowledge of the past
of the living world. The study of molecular evolution thus
largely relies on the analysis of extant genes, which may
or may not be a representative sample of genetic diversity
throughout the evolution of life. Central in this analysis is
the notion of gene family, defined as a set of homologous
genes formed by duplications of a single original gene. Insights into the evolutionary dynamics at the molecular level
are obtained by inferring the evolutionary history of gene
gains and losses and of gene mutations in a gene family.
The usual strategy to identify gene families consists in
detecting significant sequence similarities in gene or protein sequences. This method is inherently biased towards
the detection of families that evolve mainly through local
mutations rather than domain shuffling. As Song et al.
(2008) make it clear, “multidomain sequences, especially
those with promiscuous domains that occur in many contexts, are frequently excluded from genomic analyses due to
the lack of a theoretical framework and practical methods for
detecting multidomain homologs”. Efforts are thus ongoing to develop multidomain homology identification methods (Geer et al., 2002; Enright et al., 2002; Lin et al., 2006;
Song et al., 2008; Jachiet et al., 2013).
Once gene families have been identified, their evolutionary histories are inferred, using implicit or explicit models
of evolution to describe the patterns of DNA base substitution and amino acid replacement (Liò and Goldman, 1998)
and the patterns of gene gains and losses (Arvestad et al.,
2004; Vilella et al., 2008; Akerborg et al., 2009; Rasmussen
and Kellis, 2012; Boussau et al., 2013). These models of
evolution make assumptions – for example, the model used
by Vilella et al. (2008) assumes that gene duplications and
deletions are rare events, and that duplication followed by
complementary gene losses on the left and right branches
of a duplication node is an unlikely scenario. Most models also assume that different gene families evolve independently, while a single duplication or deletion can actually
span several genes. The scarcity of well-preserved ancient
DNA samples makes it difficult to really test these hypotheses. The common practice to test a phylogenetic method is
thus to simulate artificial sequences to generate benchmarks.
However, these artificial sequences are usually generated
with the same general model of evolution as the one used
by the phylogenetic method being tested, with only minor
differences (see for example Rasmussen and Kellis (2012);
Boussau et al. (2013)). There is thus a form of circularity
in the overall process, which could leave some important aspects of evolutionary dynamics in the dark.
To provide better benchmarks for phylogenetic inference,
some simulators like EvolSimulator (Beiko and Charlebois,
ALIFE 14: Proceedings of the Fourteenth International Conference on the Synthesis and Simulation of Living Systems
2007) and ALF (Dalquen et al., 2012) have been developed independently of a particular phylogenetic inference
method. For example, ALF uses classical models of evolution at the gene sequence level, but allows for the duplication or loss of several consecutive genes at once. However, both ALF and EvolSimulator simulate only one sequence per species. The action of natural selection is incorporated in the mutational process, in the sense that only
mutations assumed to be neutral or beneficial are simulated.
For example, mutations that would lead to the formation of
a stop codon (nonsense mutations) are not allowed in ALF
(Dalquen et al., 2012). In EvolSimulator, specific probabilities of duplication and loss are pre-assigned to each gene
(Beiko and Charlebois, 2007). Simulators of sequence evolution are also being developed in population genetics (reviewed by Hoban et al. (2012)), to predict the molecular
polymorphism expected under various demographic scenarios. All individuals of the population are simulated but the
genomic architecture is usually fixed, implying that gene
gains or losses are not allowed. Deleterious events are allowed but the distribution of fitness effects is predefined at
each locus.
Experimental evolution of microbes is a much more direct way to study evolution. Although the time scale of
these laboratory experiments is short compared to those at
stakes in phylogenetic reconstruction, gene gains and losses
do occur (Nilsson et al., 2005; Blount et al., 2012; Tenaillon
et al., 2012; Maharjan et al., 2013; Payen et al., 2014) and
frozen samples allow for a partial fossil record. In silico experimental evolution, or digital genetics (Adami, 2006), can
bring a complementary perspective by providing fast synthetic genomic data, in which – in contrast to other types of
simulators – the action of natural selection on genomic sequences is not predetermined by the user. In digital genetics,
an abstract artificial chemistry is used to compute a phenotype from a genotype, and selection is based on the phenotype, not on the genotype. The Avida platform has already
been used to test the effect of selection on phylogenetic reconstruction methods (Hagstrom et al., 2004; Hang et al.,
2007). Here, we show how a digital genetics platform with a
realistic genome structure can be used to directly study gene
family evolution, with both local mutations and rearrangements. The exhaustive knowledge of all evolutionary events
allows for the identification of gene families even when sequence similarity could be impaired by rearrangements that
shuffled parts of coding sequences.
After a presentation of the model, called aevol (http:
//www.aevol.fr), we study the evolution of ten independent populations under directional selection, yielding
3,512 synthetic gene families. Using a new postprocessing
tool designed for this purpose, we analyze the dynamics of
genes within the context of those families – how and how often new genes are created, how and how often they are lost,
how many and what types of mutational events occur in their
sequences. By going beyond the simple time series of gene
number, we show that our usual interpretation of gene number evolution in aevol was partly wrong. Not all new genes
arise by duplication-divergence. Many are also created from
previously non-coding sequences, after either a local mutation or a rearrangement. We also show that there is a high
turnover of gene families, many of them lasting only a few
hundreds or thousands of generations. This implies that final
extant genes give only a partial insight into the dynamics of
gene family evolution. Moreover, our analysis reveals that
rearrangements do not restrict themselves to changing gene
number and gene order. They also play a significant role in
gene sequence evolution, and thus in evolutionary innovation, by rearranging parts of existing genes.
Aevol: A digital genetics model
Aevol is a digital genetics model that simulates the evolution
of a population of N haploid organisms through a process of
variation and selection. It was designed to study the evolution of genome structure (Knibbe et al., 2007; Beslon et al.,
2010; Parsons et al., 2010; Frenoy et al., 2013; Batut et al.,
2013). Thus, the design of the model focuses on the realism
of the genome level and of the mutational process, while the
selection process simply relies on a one-dimensional curvefitting task.
Genome representation
Each artificial organism owns a chromosome whose structure is inspired by prokaryotic genomes. It is organized as
a circular double-strand binary string containing a variable
number of genes separated by non-coding sequences (figure
1). Genes are delimited by predefined signaling sequences
indicating transcription and translation start and stop. Transcription initiates at promoters, defined in the model as sequences that differ from an (arbitrarily chosen) 22-bp consensus sequence by d ≤ 4 mismatches. When a promoter
is found, the transcription proceeds until a terminator is
reached. Terminators are defined as sequences that would
be able to form a stem-loop structure, as the ρ-independent
bacterial terminators do. In the following experiments, terminators had the structure abcd ∗ ∗ ∗ dcba, where a = 0 if
a = 1, and conversely. The expression level e of an mRNA
is determined according to the similarity of its promoter to
the consensus: e = 1 − d5 .
Transcribed sequences (mRNAs) do not necessarily contain coding sequences. The translation initiation signal is
the motif 011011 ∗ ∗ ∗ ∗000 (Shine-Dalgarno-like sequence
followed, a few base-pairs away, by a S TART codon). When
this signal is found on a mRNA, the downstream sequence is
read three bases (one codon) at a time until the termination
signal, the S TOP codon 001, is found on the same reading
frame. Each codon lying between the initiation and termination signals is translated into an abstract “amino-acid” using
an artificial genetic code (Figure 1).
ALIFE 14: Proceedings of the Fourteenth International Conference on the Synthesis and Simulation of Living Systems
111110000101011110111
101101
0
1011 010010000001111010100001000 111000
101
1
1000
100 00100
111 000
011
100 1101
1
01100110
10
01 110
01 00
01 101
10
Genetic code
(b)
oter
Prom
(a) Chromosome
(g)
Phenotype
Contribution
level
11
10
1
10010
1 100
011 1
0
ption
ri
Transc
arno
Shine-Dalg
Expression level e of the
mRNA depends on the
promoter sequence
START
STOP
M0
M1
W0
W1
H0
H1 (d)
Translation
START
«Aminoacid» sequence of the protein:
Coding
sequen
ce
W0 - M1 - H1 - W1 - M1 - H0
m Gray = 11
w Gray = 01
STOP
00
0
11
1
00
0
(f)
000
001
100
101
010
011
110
111
(c)
h Gray= 10
m binary = 10
m = 0.667
w binary = 01
w = 0.333 wmax
h binary = 11
h = 1.0
Ter
min
a
(e)
tor
Phenotypic
trait
Figure 1: In the model, each organism owns a circular double-strand binary chromosome (a) along which genes are delimited by
predefined signal sequences (b). Promoters and terminators mark the boundaries of RNAs (c) within which coding sequences
are in turn identified between a Shine-Dalgarno-S TART signal and an in-frame S TOP codon. Each coding sequence is then
translated into a protein sequence using a predefined genetic code (d). This protein sequence is decoded as three real parameters
called m, w and h (e). Proteins, phenotypes and environments are represented similarly through mathematical functions that
associate a level to each abstract phenotypic trait in [0, 1]. The contribution of a protein is a piecewise-linear function with a
triangular shape, with position m, half-width w and height h (f). All proteins encoded in the chromosome are then combined
to compute the phenotype (g), which is compared to the environmental target to compute the fitness of the individual.
Protein function and phenotype computation
In the model, we assume that there is an abstract, continuous one-dimensional space Ω = [0, 1] of phenotypic traits.
Each protein contributes positively or negatively to a subset
of phenotypic traits, and is modeled as a mathematical function that associates a contribution level between -1.0 and 1.0
to each phenotypic trait. For simplicity, we use piecewiselinear functions with a symmetric, triangular shape (figure
1). In this way, only three numbers are needed to characterize the contribution of a protein: The position m (m ∈ Ω)
of the triangle on the axis, its half-width w and its height
h (positive or negative). The protein thus contributes to the
phenotypic traits in [m − w, m + w], with a maximal contribution for the traits closest to m. Thus, various types of
proteins can co-exist, from highly efficient and highly specialized ones (low w, high h) to polyvalent but poorly efficient ones (high w, low h).
In this framework, the sequence of each protein is decomposed into three interlaced binary subsequences that will in
turn be decoded as the values for the m, w and h parameters. For instance, the codon 010 (resp. 011) is translated
into the single amino acid W 0 (resp. W 1), which means
that it adds a bit 0 (resp. 1) to the Gray code of w. (The
Gray code is a variant of the traditional binary code. It is
widely used in evolutionary computation because it avoids
the so-called Hamming cliffs: in the Gray code representation, consecutive integers are assigned bit strings that differ
by only one bit.). Small mutations in the coding sequence
(point mutations, indels, possibly causing frame shifts) can
change these parameters and hence change the contribution
of the protein to the phenotypic traits.
Once all the proteins encoded on the genotype of the organism have been identified, their contributions are combined to get the final level for each phenotypic trait. This is
done by summing the mathematical functions of all proteins
and keeping the result bounded between 0 and 1.0. The resulting piecewise-linear function fP : Ω → [0, 1.0] is called
the phenotype of the organism. It indicates the level of each
phenotypic trait in Ω.
Environment, adaptation and selection
In the model, fitness depends on the difference between the
levels of the phenotypic traits, and target levels defined by a
mathematical function fT : Ω → [0, 1.0]. This target function indicates the optimal level of each phenotypic trait in
Ω and is called the environmental target, or target for short.
Here, fT was made up of three gaussian lobes with standard
deviation 0.05 and maximal height 0.5, centered on x = 0.2,
0.6 and 0.8 respectively. It was kept constant over evolutionary time.
Adaptation was specifically measured by the gap
R
g = Ω |fT (x) − fP (x)|dx between fP and fT . The lower
the gap, the fitter the individual. This measure penalizes
both the under-realization and the over-realization of each
phenotypic trait.
In the current version of Aevol, the population size is constant (here N = 1, 000 individuals) and the population is
ALIFE 14: Proceedings of the Fourteenth International Conference on the Synthesis and Simulation of Living Systems
e
i
Mutations and rearrangements
During their replication, genomes can undergo local mutations (point mutations and small insertions or deletions of
1 to 6 bp) and chromosomal rearrangements (duplications,
deletions, translocations and inversions). The breakpoints
for these rearrangements are randomly chosen on the chromosome. A translocation is here defined as moving a segment to another position on the chromosome. Other versions of the platform exist that allow for plasmids, in which
case translocations can move subsequences from the chromosome to the plasmids and conversely. This feature was
not used here. Similarly, although lateral transfer is possible
in aevol, we did not use it in these experiments, to keep the
setup simple in this first study of gene family evolution. The
rates of the different types of genetic modification occurs are
defined per-base, per-replication.
Workflow with the software suite
The typical workflow with the aevol software suite starts
with the preparation of the initial population following the
initialization method chosen by the user (with a random
genome or with a mix of already evolved ones, for a competition assay for example). The second step is the evolutionary run itself. Depending on the parameters, a run of
100, 000 generations may take from several hours up to several days. The population size is important, but the spontaneous rates of rearrangements as well, since they influence
the evolved genome size (Knibbe et al., 2007). If ancestry
relationships and mutations have been recorded during the
run, and if individuals were asexual, it is possible, as a third
step, to extract the line of descent of the best final individual
from the recorded data and to replay the evolutionary events
that occurred along this successful lineage. Except for the
very last mutations that were possibly still segregating in the
population, the replayed mutations are those that were fixed.
Results
We let 10 populations of 1, 000 haploid asexual individuals evolve independently under directional selection during
100, 000 generations. The spontaneous rate of each type of
mutational event was set to 10−5 per bp. At the beginning of
a run, all organisms of the population were initialized with a
same random sequence of 5, 000 bp containing at least one
Number of coding sequences
i=1
mixed, but a spatial grid structure where an individual competes only with its neighbors can also be used.
gene. This initial sequence was different for each population. In practice, populations started with either one or two
genes. With this setup, we do not aim at mimicking the origin of life but rather the adaptation to a novel niche. Indeed,
in the model, an individual without any gene on its chromosome can still replicate itself and express its genes: “Core
genes” for replication, transcription, translation are assumed
to be implicitly present in each individual and their evolution
is not modeled. What is actually simulated is the evolution
of the non-essential subset of the genome, when the population faces a new environment.
As shown by Figure 2, genome evolution on the successful lineages starts with a phase of expansion, where new
genes are massively acquired, along with much non coding
DNA. This excess DNA is then progressively removed from
the genome, while gene acquisition slows down. This pattern was already observed in (Knibbe et al., 2007), in a similar setup. Here, we went deeper into the analysis of gene
repertoire dynamics by tracking the fate of each gene, as
well as the paralogy relationships between genes.
150
100
50
0
0
50000
100000
50000
100000
10^6
Non essential DNA (bp, log. scale)
entirely renewed at each generation. A probability of reproduction is assigned to each individual according to its gap
and a multinomial drawing determines the actual number of
offsprings each individual will have. Here, we used the socalled “fitness-proportionate” selection scheme, where the
probability of reproduction of an individual with gap g was
−kg
PNe −kg . Here the environment was considered perfectly
10^5
10^4
10^3
0
Generations
Figure 2: Evolution of genome size on the line of descent
of the final best individuals. The shaded area indicates the
standard deviation across repetitions. Non essential DNA is
defined as DNA that can be removed without changing the
phenotype. It includes intergenic DNA, but also the transcribed but untranslated regions (UTRs).
For each repetition, evolutionary events on the line of descent of the best final individual were replayed. Each gene
in the initial genome was tagged and considered the root of a
gene family, which was stored as a binary tree. When replay-
ALIFE 14: Proceedings of the Fourteenth International Conference on the Synthesis and Simulation of Living Systems
Number of gene families
In previous studies with aevol, which mostly relied on the
time series of gene number like on Figure 2, we thought that
most evolved genes ultimately descended from the initial
one(s). Indeed, because simulations started with only one
or two genes and because several initiation signals are necessary for a sequence to be coding, one would expect that most
genes would be created by a duplication-divergence process
and that each run would thus contain one or two gene families only. There were actually on average 351.2±158.3 gene
families per evolutionary run, contrary to what we expected.
This is not an artifact due to a saturation of the phylogenetic
signal, because gene family identification is here based on
the exact knowledge of all events and not on sequence similarity, and is thus insensitive to such a saturation. Thus, de
novo gene creation was not rare. On average, in a run, the
gene families of the initialization represented only 0.4% of
all families and 10.8% of the final genes. This fits with a
recent analysis of proto-genes candidates in the genome of
the yeast S. cerevisiae, which suggested that “de novo gene
birth may be more prevalent than sporadic gene duplication”
(Carvunis et al., 2012). In our synthetic dataset, 55% of de
novo gene creations were due to a local mutation and around
44% were due to a chromosomal rearrangement.
The large variation across runs in the number of families
stems from a bimodal distribution, with seven runs centered
around 281 families (hereafter called group A) while three
other runs (called group B) are centered around 632 families. This variation is due to the initial phase of genome
expansion, since on average 61% of the gene families in
these three runs were extinct before generation 10, 000. At
the end of the runs (t = 100, 000 generations), on average
73.4 ± 6.2 gene families were still active in each run, meaning that they had at least one remaining gene. Thus, the
families that would be accessible for classical phylogenetic
inference would represent only 21% of the gene families that
played a role in the evolutionary history of the evolved populations.
Size of gene families
What is usually called the family size is the number of nonextinct leaves at the time of observation. Here, the mean
family size at t = 100, 000 was 1.36 ± 0.1 non-extinct
leaves, meaning that each family had on average 1.36 sibling
genes (paralogs) in the evolved genome. In real genomes,
gene family size is known to follow a power-law distribution
(Huynen and van Nimwegen, 1998), with a vast majority of
very small gene families and a few very large families. As
shown by Figure 4, the family sizes obtained here do not
span enough orders of magnitude to conclude to a powerlaw distribution. However, in all runs, the vast majority of
families had size 1, while only one or two families had a size
larger than 4.
100
Number of families
ing the mutational events, the fate of each gene was followed
and recorded mutation after mutation. We considered in this
analysis that a gene was composed of its coding sequence
and of its “upstream region”, defined as the sequence located
between the first base pair of the first promoter of the coding
sequence and the first base pair of the Shine-Dalgarno-start
signal for translation initiation.
Figure 3 shows an example of a small gene family. The
topology of the tree indicates the dynamics of gene duplications and losses. Because the exact timing of events is
known, the branch lengths represent the real time elapsed,
in number of generations. The branches are annotated with
the mutational events that affected either the coding sequences or their upstream regions. These events can either
be local mutations or chromosomal rearrangements. Indeed,
aevol, along with the “ARN” model (Banzhaf, 2003) and
potentially the model of early metabolism by Ullrich et al.
(2011), is one of the few digital genetics models in which
the breakpoints of chromosomal rearrangements are not constrained to intergenic regions. They operate at the sequence
level rather than at the gene level and are thus blind to the
genic/intergenic status of the sequences they disrupt. As a
consequence, they can modify gene content and gene order,
but also generate variability in gene sequences.
10
1
1
2..3
4..7
8..15
16..31
Family size at t=100,000
Figure 4: Distribution of gene family size at t = 100, 000.
The different symbols correspond to the different repetitions
and the black curve is the mean frequency over the ten repetitions. Following Huynen and van Nimwegen (1998), family sizes were binned exponentially and both axes are logarithmic. Missing symbols correspond to a frequency of 0.
Rates of gene duplication and loss
When taking all families of a run into account, a gene gain
by duplication occurred on average every 212 generations
in group A, and every 5.5 generations in group B. A gene
loss occurred every 158 generations on average in group A,
and every 5.4 generations in group B (63% of gene losses
happened by the complete deletion of the gene, 16% were
due to another chromosomal rearrangement and 21% were
due to a local mutation). However, these rates of gene
ALIFE 14: Proceedings of the Fourteenth International Conference on the Synthesis and Simulation of Living Systems
Gene #164
Lost at t=273 after a point mutation
in the coding sequence
200 generations
Duplication
at t=247
Gene #163
Gene #130
De novo gene creation
by a rearrangement
at t=214
Lost at t=255 after a translocation
affecting both the coding sequence
and its upstream region
Small deletion in
the upstream region
Duplication
at t=240
Small
insertion
in the
upstream
region
Gene #782
Transloc.
in the
coding
sequence
Small
deletion
in the
coding
sequence
Transloc.
in the
upstream
region
Duplication
at t=297
Gene #340 Deleted at t=379
Lost at t=1,449 after an inversion
affecting the coding sequence
Duplication
at t=1,425
Gene #781
Inversion and
translocation
in the upstream
region
Small
deletion
in the
upstream
region
Lost at t=1,904 after
a translocation affecting
the coding sequence
Duplication at t=379
Gene #433 Deleted at t=407
Duplication
at t=407
Gene #432
Lost at t=1,145 after
an inversion affecting
the coding sequence
Figure 3: Example of a small gene family. This is the fourth gene family of the third run. It was born at t = 214 and went to
extinction at t = 1, 904. Squares indicate duplications, crosses indicate deletions, stars indicate inversions or translocations, and
circles indicate point mutations, small insertion or small deletions. Gray events were neutral, while red events were deleterious
and green events were beneficial. The topology of the tree was drawn with NJPlot (Perriere and Gouy, 1996).
gains and losses are somehow misleading because 90% of
the gene duplications and gene losses occurred before generation 5, 500. Hence, during the initial phase of genome
expansion, the gene content is extremely dynamic. Besides, duplications and deletions generally encompass several neighboring genes, thereby creating strong correlations
across families and inside different subtrees of a family. For
example, in the first run, there were 604 gene duplications
but they were concentrated on 85 different generations only.
Evolutionary rates of gene sequences
On each branch of each gene family, we counted and classified the events that modified the gene sequence without
killing it, and divided these counts by the branch length in
number of generations. Those rates per gene per generation were averaged over all branches of all gene families of
all runs to produce Figure 5A. It shows that the coding sequences underwent more changes than the upstream region.
This an expected result given that (in the model) changes
in the coding sequences can change the phenotypic traits to
which the gene contributes, whereas changes in the promoter
can just modulate the level of a gene contribution. Both local
mutations and rearrangements modified gene sequences, but
rearrangements were more numerous than local mutations in
branches shorter than 290 generations, which represent 90%
of the dataset. Thus, the mean rate of rearrangements over
all branches turns up to be higher than the mean rate of lo-
cal mutations. Beneficial mutations were also more frequent
than neutral events, which is expected under a directional selection setting, but also depends on the fact that the artificial
genetic code is not redundant. Neutral mutations can happen
between the promoter and the start signal, but it is a rather
small mutational target. Neutral mutations can also happen
in intergenic regions, but those were not monitored here.
Figure 5B shows the overall normalized variation of each
rate across all branches of all gene families of all runs. The
indicator with the lowest normalized variation is the rate
of all events that affect the gene sequence, regardless or
whether they are neutral or beneficial, local mutation or rearrangement. It cannot, however, be chosen as a molecular
clock, because it includes non-neutral events, whose count
would be affected by the strength and type of selection. A
good candidate for a molecular clock should thus both minimize its variation across branches and trees, and count neutral events only, in order to be robust to the selection regime.
According to this criterion, the rate of all neutral events affecting the gene sequence, including rearrangements, would
make a slightly better molecular clock than the rate of neutral local mutations only (Figure 5B, orange bars).
Analyses of gene families in real genomes of fungi, insects, and mammals have revealed a negative correlation between the age of the family and the evolutionary rate of its
members (Capra et al., 2013). As shown by Figure 5C, such
a negative correlation is clear in our synthetic data if all gene
ALIFE 14: Proceedings of the Fourteenth International Conference on the Synthesis and Simulation of Living Systems
A
Mean rate per generation
All events
Neutral events
Beneficial events
Events in coding sequence
Events in upstream region
Local mutations
Rearrangements
Neutral local mutations
Neutral rearrangements
Beneficial local mutations
Beneficial rearrangements
Local mutations in upstream region
Rearrangements in upstream region
Local mutations in coding sequence
Rearrangements in coding sequences
Neutral local mutations in coding sequences
Neutral rearrangements in coding sequences
Beneficial local mutations in coding sequences
Beneficial rearrangements in coding sequences
Neutral local mutations in upstream region
Neutral rearrangements in upstream region
Beneficial local mutations in upstream region
Beneficial rearrangements in upstream region
0.0000
B
0.0005
0.0010
0.0015
0.0020
0.0025
Normalized dispersion
All events
Neutral events
Beneficial events
Events in coding sequence
Events in upstream region
Local mutations
Rearrangements
Neutral local mutations
Neutral rearrangements
Beneficial local mutations
Beneficial rearrangements
Local mutations in upstream region
Rearrangements in upstream region
Local mutations in coding sequence
Rearrangements in coding sequences
Neutral local mutations in coding sequences
Neutral rearrangements in coding sequences
Beneficial local mutations in coding sequences
Beneficial rearrangements in coding sequences
Neutral local mutations in upstream region
Neutral rearrangements in upstream region
Beneficial local mutations in upstream region
Beneficial rearrangements in upstream region
Conclusion
Rate of sequence evolution, incl. rearr.
(mean on all branches of the family)
0
C
1000
2000
3000
4000
5000
6000
1
0.1
0.01
0.001
0.0001
r = -0.78, pvalue < 2.10^-16
r = +0.23, pvalue = 3.10^-10
1
families are considered (r = −0.78, p-value < 2 × 10−16 ).
However, if one restricts the analysis to the observable families in the final evolved genomes, the correlation becomes
positive but weaker (r = +0.23, p-value ∼ 3 × 10−10 ).
In the final genomes accessible to phylogenetic analyses, no
trace remains from the fast evolving gene families that supported the initial (yet important) steps of the adaptation to
the novel environment. Note, however, that even these final
observable families evolve relatively fast and that we are actually simulating only the subset of non essential genes that
confer a selective advantage in the novel environment. In
contrast, in real datasets reviewed in (Capra et al., 2013), the
core genes for e.g. replication and gene expression would be
included. This could explain the differences in the correlation patterns between the synthetic and the real data.
10
100
1,000
10,000 100,000
Time span of the gene family
Figure 5: A. Mean evolutionary rate, per gene per generation, for each type of event (average over all branches of
all gene families of all runs). B. Relative standard variation
(100 × standard deviation / mean) of each indicator across
all branches of all gene families of all runs. Orange bars
correspond to neutral indicators, candidates for a molecular clock. C. Correlation between the logarithm of family
time span and the logarithm of the mean evolutionary rate
in the family (black = all gene families, green = active families at t=100, 000). The mean evolutionary rate of a family
was computed as the average, over all its branches, of the
per-generation rate of events that modify the gene without
killing it. This is the indicator called “All events” in panels
A and B. It includes both local mutations and chromosomal
rearrangements.
This study of synthetic gene families revealed that, upon
adaptation to a new environment, (i) there was a high
turnover of gene families and extinct families showed patterns different from the final, observable ones, both in terms
of dynamics of gene gains and losses and in terms of gene
sequence evolution, (ii) gene sequence evolution occurred
through both local mutations and chromosomal rearrangements, and (iii) incorporating chromosomal rearrangements
in the evolutionary rate of gene sequences would slightly improve the accuracy of the molecular clock. Although some
of the results can depend on the simplifications of the model
– like the absence of redundancy of the artificial genetic code
–, the study is a demonstration of how digital genetics can
explore data inaccessible to classical phylogenetic methods.
With refined models developed in close collaboration with
phylogeneticists, digital genetics could prompt a reassessment of the biases and limitations in the studies of evolutionary dynamics of genes.
Acknowledgements
This research program was supported by the EvoEvo FP7
European project, by the PEPII program of the French
CNRS and by the Rhône-Alpes Institute for Complex Systems (IXXI). We thank Eric Tannier and Guillaume Beslon
for the inspiring discussions and comments.
References
Adami, C. (2006). Digital genetics: unravelling the genetic basis
of evolution. Nat. Rev. Genet., 7:109–118.
Akerborg, O., Sennblad, B., Arvestad, L., and Lagergren, J. (2009).
Simultaneous Bayesian gene tree reconstruction and reconciliation analysis. Proc Natl Acad Sci USA, 106(14):5714–5719.
Arvestad, L., Berglund, A.-C., Lagergren, J., and Sennblad, B.
(2004). Gene tree reconstruction and orthology analysis
based on an integrated model for duplications and sequence
evolution. In Proc. RECOMB 2004, pages 326–335. ACM
Press, New York.
ALIFE 14: Proceedings of the Fourteenth International Conference on the Synthesis and Simulation of Living Systems
Banzhaf, W. (2003). On the dynamics of an artificial regulatory
network. In Banzhaf, W., Ziegler, J., Christaller, T., Dittrich,
P., and Kim, J., editors, Advances in Artificial Life, volume
2801 of Lecture Notes in Computer Science, pages 217–227.
Springer Berlin Heidelberg.
Jachiet, P. A., Pogorelcnik, R., Berry, A., Lopez, P., and Bapteste,
E. (2013). MosaicFinder: identification of fused gene
families in sequence similarity networks. Bioinformatics,
29(7):837–844.
Batut, B., Parsons, D., Fischer, S., Beslon, G., and Knibbe, C.
(2013). In silico experimental evolution: a tool to test evolutionary scenarios. BMC Bioinformatics, 14(Suppl 15):S11.
Knibbe, C., Mazet, O., Chaudier, F., Fayard, J.-M., and Beslon, G.
(2007). Evolutionary coupling between the deleteriousness
of gene mutations and the amount of non-coding sequences.
J. Theor. Biol., 244(4):621–630.
Beiko, R. G. and Charlebois, R. L. (2007). A simulation test bed for
hypotheses of genome evolution. Bioinformatics, 23(7):825–
831.
Lin, K., Zhu, L., and Zhang, D. Y. (2006). An initial strategy for
comparing proteins at the domain architecture level. Bioinformatics, 22(17):2081–2086.
Beslon, G., Parsons, D. P., Sanchez-Dehesa, Y., Pena, J. M., and
Knibbe, C. (2010). Scaling laws in bacterial genomes: A
side-effect of selection of mutational robustness. BioSystems,
102(1):32–40.
Liò, P. and Goldman, N. (1998). Models of molecular evolution
and phylogeny. Genome Research, 8(12):1233–1244.
Blount, Z. D., Barrick, J. E., Davidson, C. J., and Lenski, R. E.
(2012). Genomic analysis of a key innovation in an experimental Escherichia coli population. Nature, 489(7417):513–
518.
Boussau, B., Szollosi, G. J., Duret, L., Gouy, M., Tannier, E., and
Daubin, V. (2013). Genome-scale coestimation of species and
gene trees. Genome Research, 23(2):323–330.
Capra, J. A., Stolzer, M., Durand, D., and Pollard, K. S. (2013).
How old is my gene? Trends in Genetics, 29(11):659–668.
Carvunis, A.-R., Rolland, T., Wapinski, I., Calderwood, M. A.,
Yildirim, M. A., Simonis, N., Charloteaux, B., Hidalgo,
C. A., Barbette, J., Santhanam, B., Brar, G. A., Weissman,
J. S., Regev, A., Thierry-Mieg, N., Cusick, M. E., and Vidal, M. (2012). Proto-genes and de novo gene birth. Nature,
487(7407):370–374.
Dalquen, D. A., Anisimova, M., Gonnet, G. H., and Dessimoz, C.
(2012). ALF–A Simulation Framework for Genome Evolution. Molecular Biology and Evolution, 29(4):1115–1123.
Enright, A. J., Van Dongen, S., and Ouzounis, C. A. (2002). An efficient algorithm for large-scale detection of protein families.
Nucleic Acids Research, 30(7):1575–1584.
Frenoy, A., Taddei, F., and Misevic, D. (2013). Genetic architecture promotes the evolution and maintenance of cooperation.
PLoS Computational Biology, 9(11):e1003339.
Geer, L. Y., Domrachev, M., Lipman, D. J., and Bryant, S. H.
(2002). CDART: protein homology by domain architecture.
Genome Research, 12(10):1619–1623.
Maharjan, R. P., Gaff, J. l., Plucain, J., Schliep, M., Wang, L., Feng,
L., Tenaillon, O., Ferenci, T., and Schneider, D. (2013). A
case of adaptation through a mutation in a tandem duplication during experimental evolution in Escherichia coli. BMC
Genomics, 14(1):1–1.
Nilsson, A. I., Koskiniemi, S., Eriksson, S., Kugelberg, E., Hinton, J. C. D., and Andersson, D. I. (2005). Bacterial genome
size reduction by experimental evolution. Proc Natl Acad Sci
USA, 102(34):12112–12116.
Parsons, D. P., Knibbe, C., and Beslon, G. (2010). Importance of
the rearrangement rates on the organization of transcription.
In Proceedings of Artificial Life XII, pages 479–486.
Payen, C., Di Rienzi, S. C., Ong, G. T., Pogachar, J. L.,
Sanchez, J. C., Sunshine, A. B., Raghuraman, M. K.,
Brewer, B. J., and Dunham, M. J. (2014). The dynamics
of diverse segmental amplifications in populations of saccharomyces cerevisiae adapting to strong selection. G3:
Genes—Genomes—Genetics, 4(3):399–409.
Perriere, G. and Gouy, M. (1996). Www-query: An on-line retrieval system for biological sequence banks. Biochimie,
78(5):364 – 369.
Rasmussen, M. D. and Kellis, M. (2012). Unified modeling of gene
duplication, loss, and coalescence using a locus tree. Genome
Research, 22(4):755–765.
Song, N., Joseph, J. M., Davis, G. B., and Durand, D. (2008).
Sequence Similarity Network Reveals Common Ancestry
of Multidomain Proteins. PLoS Computational Biology,
4(5):e1000063.
Hagstrom, G. I., Hang, D. H., Ofria, C., and Torng, E. (2004). Using Avida to test the effects of natural selection on phylogenetic reconstruction methods. Artificial life, 10(2):157–166.
Tenaillon, O., Rodriguez-Verdugo, A., Gaut, R. L., McDonald,
P., Bennett, A. F., Long, A. D., and Gaut, B. S. (2012).
The Molecular Diversity of Adaptive Convergence. Science,
335(6067):457–461.
Hang, D., Torng, E., Ofria, C., and Schmidt, T. M. (2007). The
effect of natural selection on the performance of maximum
parsimony. BMC Evolutionary Biology, 7(1):94.
Ullrich, A., Rohrschneider, M., Scheuermann, G., Stadler, P. F.,
and Flamm, C. (2011).
In silico evolution of early
metabolism. Artificial Life, 17(2):87–108.
Hoban, S., Bertorelle, G., and Gaggiotti, O. E. (2012). Computer
simulations: tools for population and evolutionary genetics.
Nature Reviews Genetics, 13(2):110–122.
Vilella, A. J., Severin, J., Ureta-Vidal, A., Heng, L., Durbin, R., and
Birney, E. (2008). EnsemblCompara GeneTrees: Complete,
duplication-aware phylogenetic trees in vertebrates. Genome
Research, 19(2):327–335.
Huynen, M. A. and van Nimwegen, E. (1998). The frequency distribution of gene family sizes in complete genomes. Molecular Biology and Evolution, 15(5):583–589.
ALIFE 14: Proceedings of the Fourteenth International Conference on the Synthesis and Simulation of Living Systems