Download What happened to my genes? Insights on gene family dynamics

DOI: http://dx.doi.org/10.7551/978-0-262-32621-6-ch006 What happened to my genes? Insights on gene family dynamics from digital genetics experiments C. Knibbe1,2 and D. P. Parsons1 1 2 INRIA Rhône-Alpes, Montbonnot F-38322, France Université de Lyon, Université Lyon 1, CNRS, UMR5205, LIRIS, Villeurbanne, F-69622, France [email protected] Abstract Gene families are sets of homologous genes formed by duplications of a single original gene. Inferring their history in terms of gene duplications, gene losses and gene mutations yields fundamental insights into the molecular basis of evolution. However, phylogenetic inference of gene family evolution faces two difficulties: (i) the delimitation of gene families based on sequence similarity, and (ii) the fact that the models of evolution used for reconstruction are tested against simulated data that are produced by the model itself. Here, we show that digital genetics, or in silico experimental evolution, can provide thought-provoking synthetic gene family data, robust to rearrangements in gene sequences and, most importantly, not biased by where and how we think natural selection should act. Using aevol, a digital genetics model with an abstract phenotype but a realistic genome structure, we analyzed the evolution of 3,512 synthetic gene families under directional selection. The turnover of gene families in evolutionary runs was such that only 21% of those families would be accessible for classical phylogenetic inference. Extinct families showed patterns different from the final, observable ones, both in terms of dynamics of gene gains and losses and in terms of gene sequence evolution. This study also reveals that gene sequence evolution, and thus evolutionary innovation, occurred not only through local mutations, but also through chromosomal rearrangements that re-assembled parts of existing genes. Introduction How do new genes arise? Do they evolve mainly by local mutations or domain shuffling? Which events drive them to extinction? These questions are fundamental to understand the evolutionary dynamics of living systems. Because the preservation of soft tissues is rare in fossil records, paleontology provides precious but limited knowledge of the past of the living world. The study of molecular evolution thus largely relies on the analysis of extant genes, which may or may not be a representative sample of genetic diversity throughout the evolution of life. Central in this analysis is the notion of gene family, defined as a set of homologous genes formed by duplications of a single original gene. Insights into the evolutionary dynamics at the molecular level are obtained by inferring the evolutionary history of gene gains and losses and of gene mutations in a gene family. The usual strategy to identify gene families consists in detecting significant sequence similarities in gene or protein sequences. This method is inherently biased towards the detection of families that evolve mainly through local mutations rather than domain shuffling. As Song et al. (2008) make it clear, “multidomain sequences, especially those with promiscuous domains that occur in many contexts, are frequently excluded from genomic analyses due to the lack of a theoretical framework and practical methods for detecting multidomain homologs”. Efforts are thus ongoing to develop multidomain homology identification methods (Geer et al., 2002; Enright et al., 2002; Lin et al., 2006; Song et al., 2008; Jachiet et al., 2013). Once gene families have been identified, their evolutionary histories are inferred, using implicit or explicit models of evolution to describe the patterns of DNA base substitution and amino acid replacement (Liò and Goldman, 1998) and the patterns of gene gains and losses (Arvestad et al., 2004; Vilella et al., 2008; Akerborg et al., 2009; Rasmussen and Kellis, 2012; Boussau et al., 2013). These models of evolution make assumptions – for example, the model used by Vilella et al. (2008) assumes that gene duplications and deletions are rare events, and that duplication followed by complementary gene losses on the left and right branches of a duplication node is an unlikely scenario. Most models also assume that different gene families evolve independently, while a single duplication or deletion can actually span several genes. The scarcity of well-preserved ancient DNA samples makes it difficult to really test these hypotheses. The common practice to test a phylogenetic method is thus to simulate artificial sequences to generate benchmarks. However, these artificial sequences are usually generated with the same general model of evolution as the one used by the phylogenetic method being tested, with only minor differences (see for example Rasmussen and Kellis (2012); Boussau et al. (2013)). There is thus a form of circularity in the overall process, which could leave some important aspects of evolutionary dynamics in the dark. To provide better benchmarks for phylogenetic inference, some simulators like EvolSimulator (Beiko and Charlebois, ALIFE 14: Proceedings of the Fourteenth International Conference on the Synthesis and Simulation of Living Systems 2007) and ALF (Dalquen et al., 2012) have been developed independently of a particular phylogenetic inference method. For example, ALF uses classical models of evolution at the gene sequence level, but allows for the duplication or loss of several consecutive genes at once. However, both ALF and EvolSimulator simulate only one sequence per species. The action of natural selection is incorporated in the mutational process, in the sense that only mutations assumed to be neutral or beneficial are simulated. For example, mutations that would lead to the formation of a stop codon (nonsense mutations) are not allowed in ALF (Dalquen et al., 2012). In EvolSimulator, specific probabilities of duplication and loss are pre-assigned to each gene (Beiko and Charlebois, 2007). Simulators of sequence evolution are also being developed in population genetics (reviewed by Hoban et al. (2012)), to predict the molecular polymorphism expected under various demographic scenarios. All individuals of the population are simulated but the genomic architecture is usually fixed, implying that gene gains or losses are not allowed. Deleterious events are allowed but the distribution of fitness effects is predefined at each locus. Experimental evolution of microbes is a much more direct way to study evolution. Although the time scale of these laboratory experiments is short compared to those at stakes in phylogenetic reconstruction, gene gains and losses do occur (Nilsson et al., 2005; Blount et al., 2012; Tenaillon et al., 2012; Maharjan et al., 2013; Payen et al., 2014) and frozen samples allow for a partial fossil record. In silico experimental evolution, or digital genetics (Adami, 2006), can bring a complementary perspective by providing fast synthetic genomic data, in which – in contrast to other types of simulators – the action of natural selection on genomic sequences is not predetermined by the user. In digital genetics, an abstract artificial chemistry is used to compute a phenotype from a genotype, and selection is based on the phenotype, not on the genotype. The Avida platform has already been used to test the effect of selection on phylogenetic reconstruction methods (Hagstrom et al., 2004; Hang et al., 2007). Here, we show how a digital genetics platform with a realistic genome structure can be used to directly study gene family evolution, with both local mutations and rearrangements. The exhaustive knowledge of all evolutionary events allows for the identification of gene families even when sequence similarity could be impaired by rearrangements that shuffled parts of coding sequences. After a presentation of the model, called aevol (http: //www.aevol.fr), we study the evolution of ten independent populations under directional selection, yielding 3,512 synthetic gene families. Using a new postprocessing tool designed for this purpose, we analyze the dynamics of genes within the context of those families – how and how often new genes are created, how and how often they are lost, how many and what types of mutational events occur in their sequences. By going beyond the simple time series of gene number, we show that our usual interpretation of gene number evolution in aevol was partly wrong. Not all new genes arise by duplication-divergence. Many are also created from previously non-coding sequences, after either a local mutation or a rearrangement. We also show that there is a high turnover of gene families, many of them lasting only a few hundreds or thousands of generations. This implies that final extant genes give only a partial insight into the dynamics of gene family evolution. Moreover, our analysis reveals that rearrangements do not restrict themselves to changing gene number and gene order. They also play a significant role in gene sequence evolution, and thus in evolutionary innovation, by rearranging parts of existing genes. Aevol: A digital genetics model Aevol is a digital genetics model that simulates the evolution of a population of N haploid organisms through a process of variation and selection. It was designed to study the evolution of genome structure (Knibbe et al., 2007; Beslon et al., 2010; Parsons et al., 2010; Frenoy et al., 2013; Batut et al., 2013). Thus, the design of the model focuses on the realism of the genome level and of the mutational process, while the selection process simply relies on a one-dimensional curvefitting task. Genome representation Each artificial organism owns a chromosome whose structure is inspired by prokaryotic genomes. It is organized as a circular double-strand binary string containing a variable number of genes separated by non-coding sequences (figure 1). Genes are delimited by predefined signaling sequences indicating transcription and translation start and stop. Transcription initiates at promoters, defined in the model as sequences that differ from an (arbitrarily chosen) 22-bp consensus sequence by d ≤ 4 mismatches. When a promoter is found, the transcription proceeds until a terminator is reached. Terminators are defined as sequences that would be able to form a stem-loop structure, as the ρ-independent bacterial terminators do. In the following experiments, terminators had the structure abcd ∗ ∗ ∗ dcba, where a = 0 if a = 1, and conversely. The expression level e of an mRNA is determined according to the similarity of its promoter to the consensus: e = 1 − d5 . Transcribed sequences (mRNAs) do not necessarily contain coding sequences. The translation initiation signal is the motif 011011 ∗ ∗ ∗ ∗000 (Shine-Dalgarno-like sequence followed, a few base-pairs away, by a S TART codon). When this signal is found on a mRNA, the downstream sequence is read three bases (one codon) at a time until the termination signal, the S TOP codon 001, is found on the same reading frame. Each codon lying between the initiation and termination signals is translated into an abstract “amino-acid” using an artificial genetic code (Figure 1). ALIFE 14: Proceedings of the Fourteenth International Conference on the Synthesis and Simulation of Living Systems 111110000101011110111 101101 0 1011 010010000001111010100001000 111000 101 1 1000 100 00100 111 000 011 100 1101 1 01100110 10 01 110 01 00 01 101 10 Genetic code (b) oter Prom (a) Chromosome (g) Phenotype Contribution level 11 10 1 10010 1 100 011 1 0 ption ri Transc arno Shine-Dalg Expression level e of the mRNA depends on the promoter sequence START STOP M0 M1 W0 W1 H0 H1 (d) Translation START «Aminoacid» sequence of the protein: Coding sequen ce W0 - M1 - H1 - W1 - M1 - H0 m Gray = 11 w Gray = 01 STOP 00 0 11 1 00 0 (f) 000 001 100 101 010 011 110 111 (c) h Gray= 10 m binary = 10 m = 0.667 w binary = 01 w = 0.333 wmax h binary = 11 h = 1.0 Ter min a (e) tor Phenotypic trait Figure 1: In the model, each organism owns a circular double-strand binary chromosome (a) along which genes are delimited by predefined signal sequences (b). Promoters and terminators mark the boundaries of RNAs (c) within which coding sequences are in turn identified between a Shine-Dalgarno-S TART signal and an in-frame S TOP codon. Each coding sequence is then translated into a protein sequence using a predefined genetic code (d). This protein sequence is decoded as three real parameters called m, w and h (e). Proteins, phenotypes and environments are represented similarly through mathematical functions that associate a level to each abstract phenotypic trait in [0, 1]. The contribution of a protein is a piecewise-linear function with a triangular shape, with position m, half-width w and height h (f). All proteins encoded in the chromosome are then combined to compute the phenotype (g), which is compared to the environmental target to compute the fitness of the individual. Protein function and phenotype computation In the model, we assume that there is an abstract, continuous one-dimensional space Ω = [0, 1] of phenotypic traits. Each protein contributes positively or negatively to a subset of phenotypic traits, and is modeled as a mathematical function that associates a contribution level between -1.0 and 1.0 to each phenotypic trait. For simplicity, we use piecewiselinear functions with a symmetric, triangular shape (figure 1). In this way, only three numbers are needed to characterize the contribution of a protein: The position m (m ∈ Ω) of the triangle on the axis, its half-width w and its height h (positive or negative). The protein thus contributes to the phenotypic traits in [m − w, m + w], with a maximal contribution for the traits closest to m. Thus, various types of proteins can co-exist, from highly efficient and highly specialized ones (low w, high h) to polyvalent but poorly efficient ones (high w, low h). In this framework, the sequence of each protein is decomposed into three interlaced binary subsequences that will in turn be decoded as the values for the m, w and h parameters. For instance, the codon 010 (resp. 011) is translated into the single amino acid W 0 (resp. W 1), which means that it adds a bit 0 (resp. 1) to the Gray code of w. (The Gray code is a variant of the traditional binary code. It is widely used in evolutionary computation because it avoids the so-called Hamming cliffs: in the Gray code representation, consecutive integers are assigned bit strings that differ by only one bit.). Small mutations in the coding sequence (point mutations, indels, possibly causing frame shifts) can change these parameters and hence change the contribution of the protein to the phenotypic traits. Once all the proteins encoded on the genotype of the organism have been identified, their contributions are combined to get the final level for each phenotypic trait. This is done by summing the mathematical functions of all proteins and keeping the result bounded between 0 and 1.0. The resulting piecewise-linear function fP : Ω → [0, 1.0] is called the phenotype of the organism. It indicates the level of each phenotypic trait in Ω. Environment, adaptation and selection In the model, fitness depends on the difference between the levels of the phenotypic traits, and target levels defined by a mathematical function fT : Ω → [0, 1.0]. This target function indicates the optimal level of each phenotypic trait in Ω and is called the environmental target, or target for short. Here, fT was made up of three gaussian lobes with standard deviation 0.05 and maximal height 0.5, centered on x = 0.2, 0.6 and 0.8 respectively. It was kept constant over evolutionary time. Adaptation was specifically measured by the gap R g = Ω |fT (x) − fP (x)|dx between fP and fT . The lower the gap, the fitter the individual. This measure penalizes both the under-realization and the over-realization of each phenotypic trait. In the current version of Aevol, the population size is constant (here N = 1, 000 individuals) and the population is ALIFE 14: Proceedings of the Fourteenth International Conference on the Synthesis and Simulation of Living Systems e i Mutations and rearrangements During their replication, genomes can undergo local mutations (point mutations and small insertions or deletions of 1 to 6 bp) and chromosomal rearrangements (duplications, deletions, translocations and inversions). The breakpoints for these rearrangements are randomly chosen on the chromosome. A translocation is here defined as moving a segment to another position on the chromosome. Other versions of the platform exist that allow for plasmids, in which case translocations can move subsequences from the chromosome to the plasmids and conversely. This feature was not used here. Similarly, although lateral transfer is possible in aevol, we did not use it in these experiments, to keep the setup simple in this first study of gene family evolution. The rates of the different types of genetic modification occurs are defined per-base, per-replication. Workflow with the software suite The typical workflow with the aevol software suite starts with the preparation of the initial population following the initialization method chosen by the user (with a random genome or with a mix of already evolved ones, for a competition assay for example). The second step is the evolutionary run itself. Depending on the parameters, a run of 100, 000 generations may take from several hours up to several days. The population size is important, but the spontaneous rates of rearrangements as well, since they influence the evolved genome size (Knibbe et al., 2007). If ancestry relationships and mutations have been recorded during the run, and if individuals were asexual, it is possible, as a third step, to extract the line of descent of the best final individual from the recorded data and to replay the evolutionary events that occurred along this successful lineage. Except for the very last mutations that were possibly still segregating in the population, the replayed mutations are those that were fixed. Results We let 10 populations of 1, 000 haploid asexual individuals evolve independently under directional selection during 100, 000 generations. The spontaneous rate of each type of mutational event was set to 10−5 per bp. At the beginning of a run, all organisms of the population were initialized with a same random sequence of 5, 000 bp containing at least one Number of coding sequences i=1 mixed, but a spatial grid structure where an individual competes only with its neighbors can also be used. gene. This initial sequence was different for each population. In practice, populations started with either one or two genes. With this setup, we do not aim at mimicking the origin of life but rather the adaptation to a novel niche. Indeed, in the model, an individual without any gene on its chromosome can still replicate itself and express its genes: “Core genes” for replication, transcription, translation are assumed to be implicitly present in each individual and their evolution is not modeled. What is actually simulated is the evolution of the non-essential subset of the genome, when the population faces a new environment. As shown by Figure 2, genome evolution on the successful lineages starts with a phase of expansion, where new genes are massively acquired, along with much non coding DNA. This excess DNA is then progressively removed from the genome, while gene acquisition slows down. This pattern was already observed in (Knibbe et al., 2007), in a similar setup. Here, we went deeper into the analysis of gene repertoire dynamics by tracking the fate of each gene, as well as the paralogy relationships between genes. 150 100 50 0 0 50000 100000 50000 100000 10^6 Non essential DNA (bp, log. scale) entirely renewed at each generation. A probability of reproduction is assigned to each individual according to its gap and a multinomial drawing determines the actual number of offsprings each individual will have. Here, we used the socalled “fitness-proportionate” selection scheme, where the probability of reproduction of an individual with gap g was −kg PNe −kg . Here the environment was considered perfectly 10^5 10^4 10^3 0 Generations Figure 2: Evolution of genome size on the line of descent of the final best individuals. The shaded area indicates the standard deviation across repetitions. Non essential DNA is defined as DNA that can be removed without changing the phenotype. It includes intergenic DNA, but also the transcribed but untranslated regions (UTRs). For each repetition, evolutionary events on the line of descent of the best final individual were replayed. Each gene in the initial genome was tagged and considered the root of a gene family, which was stored as a binary tree. When replay- ALIFE 14: Proceedings of the Fourteenth International Conference on the Synthesis and Simulation of Living Systems Number of gene families In previous studies with aevol, which mostly relied on the time series of gene number like on Figure 2, we thought that most evolved genes ultimately descended from the initial one(s). Indeed, because simulations started with only one or two genes and because several initiation signals are necessary for a sequence to be coding, one would expect that most genes would be created by a duplication-divergence process and that each run would thus contain one or two gene families only. There were actually on average 351.2±158.3 gene families per evolutionary run, contrary to what we expected. This is not an artifact due to a saturation of the phylogenetic signal, because gene family identification is here based on the exact knowledge of all events and not on sequence similarity, and is thus insensitive to such a saturation. Thus, de novo gene creation was not rare. On average, in a run, the gene families of the initialization represented only 0.4% of all families and 10.8% of the final genes. This fits with a recent analysis of proto-genes candidates in the genome of the yeast S. cerevisiae, which suggested that “de novo gene birth may be more prevalent than sporadic gene duplication” (Carvunis et al., 2012). In our synthetic dataset, 55% of de novo gene creations were due to a local mutation and around 44% were due to a chromosomal rearrangement. The large variation across runs in the number of families stems from a bimodal distribution, with seven runs centered around 281 families (hereafter called group A) while three other runs (called group B) are centered around 632 families. This variation is due to the initial phase of genome expansion, since on average 61% of the gene families in these three runs were extinct before generation 10, 000. At the end of the runs (t = 100, 000 generations), on average 73.4 ± 6.2 gene families were still active in each run, meaning that they had at least one remaining gene. Thus, the families that would be accessible for classical phylogenetic inference would represent only 21% of the gene families that played a role in the evolutionary history of the evolved populations. Size of gene families What is usually called the family size is the number of nonextinct leaves at the time of observation. Here, the mean family size at t = 100, 000 was 1.36 ± 0.1 non-extinct leaves, meaning that each family had on average 1.36 sibling genes (paralogs) in the evolved genome. In real genomes, gene family size is known to follow a power-law distribution (Huynen and van Nimwegen, 1998), with a vast majority of very small gene families and a few very large families. As shown by Figure 4, the family sizes obtained here do not span enough orders of magnitude to conclude to a powerlaw distribution. However, in all runs, the vast majority of families had size 1, while only one or two families had a size larger than 4. 100 Number of families ing the mutational events, the fate of each gene was followed and recorded mutation after mutation. We considered in this analysis that a gene was composed of its coding sequence and of its “upstream region”, defined as the sequence located between the first base pair of the first promoter of the coding sequence and the first base pair of the Shine-Dalgarno-start signal for translation initiation. Figure 3 shows an example of a small gene family. The topology of the tree indicates the dynamics of gene duplications and losses. Because the exact timing of events is known, the branch lengths represent the real time elapsed, in number of generations. The branches are annotated with the mutational events that affected either the coding sequences or their upstream regions. These events can either be local mutations or chromosomal rearrangements. Indeed, aevol, along with the “ARN” model (Banzhaf, 2003) and potentially the model of early metabolism by Ullrich et al. (2011), is one of the few digital genetics models in which the breakpoints of chromosomal rearrangements are not constrained to intergenic regions. They operate at the sequence level rather than at the gene level and are thus blind to the genic/intergenic status of the sequences they disrupt. As a consequence, they can modify gene content and gene order, but also generate variability in gene sequences. 10 1 1 2..3 4..7 8..15 16..31 Family size at t=100,000 Figure 4: Distribution of gene family size at t = 100, 000. The different symbols correspond to the different repetitions and the black curve is the mean frequency over the ten repetitions. Following Huynen and van Nimwegen (1998), family sizes were binned exponentially and both axes are logarithmic. Missing symbols correspond to a frequency of 0. Rates of gene duplication and loss When taking all families of a run into account, a gene gain by duplication occurred on average every 212 generations in group A, and every 5.5 generations in group B. A gene loss occurred every 158 generations on average in group A, and every 5.4 generations in group B (63% of gene losses happened by the complete deletion of the gene, 16% were due to another chromosomal rearrangement and 21% were due to a local mutation). However, these rates of gene ALIFE 14: Proceedings of the Fourteenth International Conference on the Synthesis and Simulation of Living Systems Gene #164 Lost at t=273 after a point mutation in the coding sequence 200 generations Duplication at t=247 Gene #163 Gene #130 De novo gene creation by a rearrangement at t=214 Lost at t=255 after a translocation affecting both the coding sequence and its upstream region Small deletion in the upstream region Duplication at t=240 Small insertion in the upstream region Gene #782 Transloc. in the coding sequence Small deletion in the coding sequence Transloc. in the upstream region Duplication at t=297 Gene #340 Deleted at t=379 Lost at t=1,449 after an inversion affecting the coding sequence Duplication at t=1,425 Gene #781 Inversion and translocation in the upstream region Small deletion in the upstream region Lost at t=1,904 after a translocation affecting the coding sequence Duplication at t=379 Gene #433 Deleted at t=407 Duplication at t=407 Gene #432 Lost at t=1,145 after an inversion affecting the coding sequence Figure 3: Example of a small gene family. This is the fourth gene family of the third run. It was born at t = 214 and went to extinction at t = 1, 904. Squares indicate duplications, crosses indicate deletions, stars indicate inversions or translocations, and circles indicate point mutations, small insertion or small deletions. Gray events were neutral, while red events were deleterious and green events were beneficial. The topology of the tree was drawn with NJPlot (Perriere and Gouy, 1996). gains and losses are somehow misleading because 90% of the gene duplications and gene losses occurred before generation 5, 500. Hence, during the initial phase of genome expansion, the gene content is extremely dynamic. Besides, duplications and deletions generally encompass several neighboring genes, thereby creating strong correlations across families and inside different subtrees of a family. For example, in the first run, there were 604 gene duplications but they were concentrated on 85 different generations only. Evolutionary rates of gene sequences On each branch of each gene family, we counted and classified the events that modified the gene sequence without killing it, and divided these counts by the branch length in number of generations. Those rates per gene per generation were averaged over all branches of all gene families of all runs to produce Figure 5A. It shows that the coding sequences underwent more changes than the upstream region. This an expected result given that (in the model) changes in the coding sequences can change the phenotypic traits to which the gene contributes, whereas changes in the promoter can just modulate the level of a gene contribution. Both local mutations and rearrangements modified gene sequences, but rearrangements were more numerous than local mutations in branches shorter than 290 generations, which represent 90% of the dataset. Thus, the mean rate of rearrangements over all branches turns up to be higher than the mean rate of local mutations. Beneficial mutations were also more frequent than neutral events, which is expected under a directional selection setting, but also depends on the fact that the artificial genetic code is not redundant. Neutral mutations can happen between the promoter and the start signal, but it is a rather small mutational target. Neutral mutations can also happen in intergenic regions, but those were not monitored here. Figure 5B shows the overall normalized variation of each rate across all branches of all gene families of all runs. The indicator with the lowest normalized variation is the rate of all events that affect the gene sequence, regardless or whether they are neutral or beneficial, local mutation or rearrangement. It cannot, however, be chosen as a molecular clock, because it includes non-neutral events, whose count would be affected by the strength and type of selection. A good candidate for a molecular clock should thus both minimize its variation across branches and trees, and count neutral events only, in order to be robust to the selection regime. According to this criterion, the rate of all neutral events affecting the gene sequence, including rearrangements, would make a slightly better molecular clock than the rate of neutral local mutations only (Figure 5B, orange bars). Analyses of gene families in real genomes of fungi, insects, and mammals have revealed a negative correlation between the age of the family and the evolutionary rate of its members (Capra et al., 2013). As shown by Figure 5C, such a negative correlation is clear in our synthetic data if all gene ALIFE 14: Proceedings of the Fourteenth International Conference on the Synthesis and Simulation of Living Systems A Mean rate per generation All events Neutral events Beneficial events Events in coding sequence Events in upstream region Local mutations Rearrangements Neutral local mutations Neutral rearrangements Beneficial local mutations Beneficial rearrangements Local mutations in upstream region Rearrangements in upstream region Local mutations in coding sequence Rearrangements in coding sequences Neutral local mutations in coding sequences Neutral rearrangements in coding sequences Beneficial local mutations in coding sequences Beneficial rearrangements in coding sequences Neutral local mutations in upstream region Neutral rearrangements in upstream region Beneficial local mutations in upstream region Beneficial rearrangements in upstream region 0.0000 B 0.0005 0.0010 0.0015 0.0020 0.0025 Normalized dispersion All events Neutral events Beneficial events Events in coding sequence Events in upstream region Local mutations Rearrangements Neutral local mutations Neutral rearrangements Beneficial local mutations Beneficial rearrangements Local mutations in upstream region Rearrangements in upstream region Local mutations in coding sequence Rearrangements in coding sequences Neutral local mutations in coding sequences Neutral rearrangements in coding sequences Beneficial local mutations in coding sequences Beneficial rearrangements in coding sequences Neutral local mutations in upstream region Neutral rearrangements in upstream region Beneficial local mutations in upstream region Beneficial rearrangements in upstream region Conclusion Rate of sequence evolution, incl. rearr. (mean on all branches of the family) 0 C 1000 2000 3000 4000 5000 6000 1 0.1 0.01 0.001 0.0001 r = -0.78, pvalue < 2.10^-16 r = +0.23, pvalue = 3.10^-10 1 families are considered (r = −0.78, p-value < 2 × 10−16 ). However, if one restricts the analysis to the observable families in the final evolved genomes, the correlation becomes positive but weaker (r = +0.23, p-value ∼ 3 × 10−10 ). In the final genomes accessible to phylogenetic analyses, no trace remains from the fast evolving gene families that supported the initial (yet important) steps of the adaptation to the novel environment. Note, however, that even these final observable families evolve relatively fast and that we are actually simulating only the subset of non essential genes that confer a selective advantage in the novel environment. In contrast, in real datasets reviewed in (Capra et al., 2013), the core genes for e.g. replication and gene expression would be included. This could explain the differences in the correlation patterns between the synthetic and the real data. 10 100 1,000 10,000 100,000 Time span of the gene family Figure 5: A. Mean evolutionary rate, per gene per generation, for each type of event (average over all branches of all gene families of all runs). B. Relative standard variation (100 × standard deviation / mean) of each indicator across all branches of all gene families of all runs. Orange bars correspond to neutral indicators, candidates for a molecular clock. C. Correlation between the logarithm of family time span and the logarithm of the mean evolutionary rate in the family (black = all gene families, green = active families at t=100, 000). The mean evolutionary rate of a family was computed as the average, over all its branches, of the per-generation rate of events that modify the gene without killing it. This is the indicator called “All events” in panels A and B. It includes both local mutations and chromosomal rearrangements. This study of synthetic gene families revealed that, upon adaptation to a new environment, (i) there was a high turnover of gene families and extinct families showed patterns different from the final, observable ones, both in terms of dynamics of gene gains and losses and in terms of gene sequence evolution, (ii) gene sequence evolution occurred through both local mutations and chromosomal rearrangements, and (iii) incorporating chromosomal rearrangements in the evolutionary rate of gene sequences would slightly improve the accuracy of the molecular clock. Although some of the results can depend on the simplifications of the model – like the absence of redundancy of the artificial genetic code –, the study is a demonstration of how digital genetics can explore data inaccessible to classical phylogenetic methods. With refined models developed in close collaboration with phylogeneticists, digital genetics could prompt a reassessment of the biases and limitations in the studies of evolutionary dynamics of genes. Acknowledgements This research program was supported by the EvoEvo FP7 European project, by the PEPII program of the French CNRS and by the Rhône-Alpes Institute for Complex Systems (IXXI). We thank Eric Tannier and Guillaume Beslon for the inspiring discussions and comments. References Adami, C. (2006). Digital genetics: unravelling the genetic basis of evolution. Nat. Rev. Genet., 7:109–118. Akerborg, O., Sennblad, B., Arvestad, L., and Lagergren, J. (2009). Simultaneous Bayesian gene tree reconstruction and reconciliation analysis. Proc Natl Acad Sci USA, 106(14):5714–5719. Arvestad, L., Berglund, A.-C., Lagergren, J., and Sennblad, B. (2004). Gene tree reconstruction and orthology analysis based on an integrated model for duplications and sequence evolution. In Proc. RECOMB 2004, pages 326–335. ACM Press, New York. ALIFE 14: Proceedings of the Fourteenth International Conference on the Synthesis and Simulation of Living Systems Banzhaf, W. (2003). On the dynamics of an artificial regulatory network. In Banzhaf, W., Ziegler, J., Christaller, T., Dittrich, P., and Kim, J., editors, Advances in Artificial Life, volume 2801 of Lecture Notes in Computer Science, pages 217–227. Springer Berlin Heidelberg. Jachiet, P. A., Pogorelcnik, R., Berry, A., Lopez, P., and Bapteste, E. (2013). MosaicFinder: identification of fused gene families in sequence similarity networks. Bioinformatics, 29(7):837–844. Batut, B., Parsons, D., Fischer, S., Beslon, G., and Knibbe, C. (2013). In silico experimental evolution: a tool to test evolutionary scenarios. BMC Bioinformatics, 14(Suppl 15):S11. Knibbe, C., Mazet, O., Chaudier, F., Fayard, J.-M., and Beslon, G. (2007). Evolutionary coupling between the deleteriousness of gene mutations and the amount of non-coding sequences. J. Theor. Biol., 244(4):621–630. Beiko, R. G. and Charlebois, R. L. (2007). A simulation test bed for hypotheses of genome evolution. Bioinformatics, 23(7):825– 831. Lin, K., Zhu, L., and Zhang, D. Y. (2006). An initial strategy for comparing proteins at the domain architecture level. Bioinformatics, 22(17):2081–2086. Beslon, G., Parsons, D. P., Sanchez-Dehesa, Y., Pena, J. M., and Knibbe, C. (2010). Scaling laws in bacterial genomes: A side-effect of selection of mutational robustness. BioSystems, 102(1):32–40. Liò, P. and Goldman, N. (1998). Models of molecular evolution and phylogeny. Genome Research, 8(12):1233–1244. Blount, Z. D., Barrick, J. E., Davidson, C. J., and Lenski, R. E. (2012). Genomic analysis of a key innovation in an experimental Escherichia coli population. Nature, 489(7417):513– 518. Boussau, B., Szollosi, G. J., Duret, L., Gouy, M., Tannier, E., and Daubin, V. (2013). Genome-scale coestimation of species and gene trees. Genome Research, 23(2):323–330. Capra, J. A., Stolzer, M., Durand, D., and Pollard, K. S. (2013). How old is my gene? Trends in Genetics, 29(11):659–668. Carvunis, A.-R., Rolland, T., Wapinski, I., Calderwood, M. A., Yildirim, M. A., Simonis, N., Charloteaux, B., Hidalgo, C. A., Barbette, J., Santhanam, B., Brar, G. A., Weissman, J. S., Regev, A., Thierry-Mieg, N., Cusick, M. E., and Vidal, M. (2012). Proto-genes and de novo gene birth. Nature, 487(7407):370–374. Dalquen, D. A., Anisimova, M., Gonnet, G. H., and Dessimoz, C. (2012). ALF–A Simulation Framework for Genome Evolution. Molecular Biology and Evolution, 29(4):1115–1123. Enright, A. J., Van Dongen, S., and Ouzounis, C. A. (2002). An efficient algorithm for large-scale detection of protein families. Nucleic Acids Research, 30(7):1575–1584. Frenoy, A., Taddei, F., and Misevic, D. (2013). Genetic architecture promotes the evolution and maintenance of cooperation. PLoS Computational Biology, 9(11):e1003339. Geer, L. Y., Domrachev, M., Lipman, D. J., and Bryant, S. H. (2002). CDART: protein homology by domain architecture. Genome Research, 12(10):1619–1623. Maharjan, R. P., Gaff, J. l., Plucain, J., Schliep, M., Wang, L., Feng, L., Tenaillon, O., Ferenci, T., and Schneider, D. (2013). A case of adaptation through a mutation in a tandem duplication during experimental evolution in Escherichia coli. BMC Genomics, 14(1):1–1. Nilsson, A. I., Koskiniemi, S., Eriksson, S., Kugelberg, E., Hinton, J. C. D., and Andersson, D. I. (2005). Bacterial genome size reduction by experimental evolution. Proc Natl Acad Sci USA, 102(34):12112–12116. Parsons, D. P., Knibbe, C., and Beslon, G. (2010). Importance of the rearrangement rates on the organization of transcription. In Proceedings of Artificial Life XII, pages 479–486. Payen, C., Di Rienzi, S. C., Ong, G. T., Pogachar, J. L., Sanchez, J. C., Sunshine, A. B., Raghuraman, M. K., Brewer, B. J., and Dunham, M. J. (2014). The dynamics of diverse segmental amplifications in populations of saccharomyces cerevisiae adapting to strong selection. G3: Genes—Genomes—Genetics, 4(3):399–409. Perriere, G. and Gouy, M. (1996). Www-query: An on-line retrieval system for biological sequence banks. Biochimie, 78(5):364 – 369. Rasmussen, M. D. and Kellis, M. (2012). Unified modeling of gene duplication, loss, and coalescence using a locus tree. Genome Research, 22(4):755–765. Song, N., Joseph, J. M., Davis, G. B., and Durand, D. (2008). Sequence Similarity Network Reveals Common Ancestry of Multidomain Proteins. PLoS Computational Biology, 4(5):e1000063. Hagstrom, G. I., Hang, D. H., Ofria, C., and Torng, E. (2004). Using Avida to test the effects of natural selection on phylogenetic reconstruction methods. Artificial life, 10(2):157–166. Tenaillon, O., Rodriguez-Verdugo, A., Gaut, R. L., McDonald, P., Bennett, A. F., Long, A. D., and Gaut, B. S. (2012). The Molecular Diversity of Adaptive Convergence. Science, 335(6067):457–461. Hang, D., Torng, E., Ofria, C., and Schmidt, T. M. (2007). The effect of natural selection on the performance of maximum parsimony. BMC Evolutionary Biology, 7(1):94. Ullrich, A., Rohrschneider, M., Scheuermann, G., Stadler, P. F., and Flamm, C. (2011). In silico evolution of early metabolism. Artificial Life, 17(2):87–108. Hoban, S., Bertorelle, G., and Gaggiotti, O. E. (2012). Computer simulations: tools for population and evolutionary genetics. Nature Reviews Genetics, 13(2):110–122. Vilella, A. J., Severin, J., Ureta-Vidal, A., Heng, L., Durbin, R., and Birney, E. (2008). EnsemblCompara GeneTrees: Complete, duplication-aware phylogenetic trees in vertebrates. Genome Research, 19(2):327–335. Huynen, M. A. and van Nimwegen, E. (1998). The frequency distribution of gene family sizes in complete genomes. Molecular Biology and Evolution, 15(5):583–589. ALIFE 14: Proceedings of the Fourteenth International Conference on the Synthesis and Simulation of Living Systems

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download What happened to my genes? Insights on gene family dynamics