* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Comparative genomics and the evolution of prokaryotes
Koinophilia wikipedia , lookup
Gene therapy wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Mitochondrial DNA wikipedia , lookup
Short interspersed nuclear elements (SINEs) wikipedia , lookup
Gene nomenclature wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Copy-number variation wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Essential gene wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Oncogenomics wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Transposable element wikipedia , lookup
Genetic engineering wikipedia , lookup
Gene desert wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Non-coding DNA wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Ridge (biology) wikipedia , lookup
Gene expression programming wikipedia , lookup
Genomic imprinting wikipedia , lookup
Human genome wikipedia , lookup
Genomic library wikipedia , lookup
Metagenomics wikipedia , lookup
History of genetic engineering wikipedia , lookup
Human Genome Project wikipedia , lookup
Gene expression profiling wikipedia , lookup
Public health genomics wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Designer baby wikipedia , lookup
Genome editing wikipedia , lookup
Genome (book) wikipedia , lookup
Helitron (biology) wikipedia , lookup
Microevolution wikipedia , lookup
Minimal genome wikipedia , lookup
Review TRENDS in Microbiology Vol.15 No.3 Comparative genomics and the evolution of prokaryotes Sophie Abby and Vincent Daubin Université de Lyon, Université Lyon 1, Centre National de la Recherche Scientifique, UMR5558, Laboratoire de Biométrie et Biologie évolutive, Villeurbanne, F-69622 cedex, France Although biologists have long recognized the importance of studying evolution to understand the organization of living organisms, only with the development of genomics have evolutionary studies become part of their routine toolkit. Placing genomes into an evolutionary framework has proved useful for understanding the functioning of organisms. It has also substantially increased understanding of the processes by which genomes evolve and led to a re-evaluation of our representation of the diversity and the history of life. In this review, we present some of the most important recent advances and promising leads in the field of microbial evolutionary genomics. Genomics: a new era for molecular evolution The science of molecular evolution is in its golden age. The dawning era of genomics provides invaluable information for studying the hereditary material in both its micro- and macro-evolution (see Glossary). Especially in prokaryotes, where the number of sequenced genomes will soon be counted in thousands, understanding how contemporary organisms evolved their range of functions is challenging and fascinating. Prokaryotes certainly provide an interesting opportunity for studying the mechanisms of evolution: they harbor a previously unsuspected diversity even within species and populations [1], they are found in small [2] to very large population sizes [3], they can survive or prosper in most inhospitable environments from the inside of eukaryotic cells [2] to hot springs [4] or spaceships [5], and they frequently acquire and use genetic material from distantly related organisms. A large majority of prokaryotic species have yet to be sampled but the task of making sense of the exponentially growing amount of available data is already enormous. However, it has also become evident that the annotation of a genome sequence greatly benefits from comparative genomic analyses. The algorithms used for predicting open reading frames (ORFs) are essentially based on the search for start and stop codons along the genome and, although these algorithms are usually efficient for annotating bacterial genomes, the best way to confirm the functional status of ORFs is still to study their degree of conservation among species. Although experimental validation is more specific in revealing gene functions, its application is much more complex. In addition, mutants for many evolutionary conserved genes have no detectable phenotypes [6], and Corresponding author: Daubin, V. ([email protected]). Available online 7 February 2007. www.sciencedirect.com would therefore be considered as unimportant based solely on this approach. Confirmation of functional status by comparative analysis is particularly crucial for small ORFs, which can occur by chance [7]. In addition, the prediction of coding sequences is relatively straightforward, but systematically identifying other functional regions such as small non-coding (NC) RNAs can be impractical without the tools of comparative genomics [8]. This review presents recent advances in our understanding of molecular evolution of prokaryotes that arose from comparative genomic approaches. Comparisons across genomes from distantly related organisms have enabled the identification of constraints that determine the organization of the bacterial chromosome and the identification of mechanisms responsible for their incomparable diversity and adaptability. Evolutionary biologists are beginning to uncover the processes of gene birth and death and how newly acquired functions integrate into an existing complex cellular machinery. Universal features and diversity of prokaryotic genomes Comparative studies of genomes have revealed that bacterial chromosomes are under selective pressures that have deeply shaped their organization. The processes of replication, transcription and the regulation of gene expression all impact how genes are arranged along the genome. It has long been known that the asymmetric manner in which the bacterial chromosome is replicated, with a leading and a lagging strand, is correlated with many evolutionary features, such as differential mutational bias between the two strands [9] and location of essential genes [10] (Figure 1). Interestingly, the intensity Glossary Coalescence: a coalescence event for two lineages is the first event of common ancestry of the two lineages. Core genome: set of genes shared by all genomes in a species or taxa. Macroevolution: any evolutionary change above the level of species. Microevolution: any evolutionary change under the level of species. Neo-functionalization: the process by which a gene acquires a new function, generally after duplication. ORFan (orphan): gene for which no homolog is found in current databases. Orthologs: genes with a relationship that arose from a speciation event. Pan genome: the union of all the genes that can be found in a species. Paralogy: relationships between two duplicated genes. Phylogenetic inertia: influence of evolutionary history on the conservation of a character. Pseudogene: relic of an ancient functional gene that is no longer functional. Sub-functionalization: the process by which duplicated genes come to fulfill complementary functions that were all encoded by ancestral genes. 0966-842X/$ – see front matter ß 2007 Elsevier Ltd. All rights reserved. doi:10.1016/j.tim.2007.01.007 136 Review TRENDS in Microbiology Vol.15 No.3 Figure 1. Replication constraints on bacterial chromosomal organization. Two kinds of biases are detected along the bacterial chromosome: asymmetries owing to the existence of a leading and a lagging strand, and biases related to the proximity of the origin and terminus of replication (Ori and Ter). Essential genes are represented by red arrows and non-essential genes are shown in green. The thickness of an arrow is proportional to the expression rate of the gene it represents. Essential genes are preferentially located on the leading strand and highly expressed genes, especially those related to transcription and translation, tend to be closer to the origin of replication in fast-growing bacteria (see main text). The evolutionary rate and the G + C content (gray gradients) are respectively increasing and decreasing with distance to the origin. of these biases shows a strong phylogenetic inertia, that is, a good correlation with the tree of species. This could be related to the nature of the DNA polymerase complex. Indeed, the group of Firmicutes, which has two different DNA polymerase a-subunits, exhibits a much stronger bias than species that have only one DNA polymerase [10]. It was previously thought that highly expressed genes underwent a selective pressure to be co-oriented with the replication fork to avoid frequent collisions between the DNA polymerase and RNA polymerase; however, Rocha and Danchin [11] have shown that this effect is more visible for essential genes, whether highly expressed or not. Thus, they proposed that the deleterious effect of the head-on collisions of polymerases lies more in the production of truncated transcripts and, consequently, non-functional proteins than in the disruption of the replication complex. The replication of the bacterial chromosome is orientated from the origin (Ori) to the terminus (Ter). This is also correlated with several evolutionary features (Figure 1): in many genomes, genes near the terminus of replication tend to have lower content in G + C nucleotides and to exhibit higher rates of evolution [12]. Also, as a result of the replication process, genes located close to the Ori can be significantly amplified in dividing cells, in comparison with those closer to the terminus. Couturier and Rocha [13] have recently shown that, although doubling time and replication-associated gene dosage are fast evolving features, the organization of the genomes of fastgrowing bacteria are deeply impacted by the necessity to overexpress genes related to transcription and translation. Not only are these genes over-represented in the region of the origin, but the genomes of fast-dividing bacteria show www.sciencedirect.com evidence that rearrangements that would disrupt this association are counter-selected [13]. In spite of these common principles of genome organization, comparative genomics has revealed a previously unexpected degree of diversity among prokaryotic genomes. One of the most striking examples of this diversity is the comparison of gene contents within and between species. All forms of life seem to share only a handful of genes, 60 according to the review by Koonin [14], and these are mainly dedicated to translation. The genes for other fundamental functions, such as DNA replication, transcription or basic metabolism seem to be more sporadically spread in the tree of life. More surprisingly, this diversity of genome content can be seen at every phylogenetic scale. Lerat et al. [15] have estimated that the core genome of 13 g-proteobacteria contains <300 genes and, although all Escherichia coli and Shigella genomes sequenced to date have >4000 protein coding genes, these genomes share <3000 genes [16,17] (Box 1). This variability has raised the question of how to define the genome of a species, and the concept of ‘pan genome’ was proposed as the sum of all genes found in a species. Tettelin et al. [18] have examined the question of how many genomes are needed to describe fully the pan genome of a species. Their results showed that in Bacillus anthracis, the pan genome was found to be fully described with only four genomes (a probable testimony of the recent emergence of this species); however, in group B Streptococcus, the pan genome was ‘open’, meaning that the number of new genes contributed by every new genome sequence was expected to be 30 whatever the number of genomes already present in the comparison (Box 1). The study of seven strains of E. coli Review TRENDS in Microbiology Vol.15 No.3 137 Box 1. Core and pan genomes of prokaryotes The diversity of genome repertoires is illustrated by two complementary concepts: the core genome and the pan genome. The number of genes per genome varies widely between and within kingdoms (Figure I). The ranges of genome sizes are given under kingdom names. The core genome is defined as the set of genes that is shared by all members of a monophyletic group. It only represents a minimal estimate of the gene repertoire of the common ancestor. The small size of core genomes and a comparison with genome sizes of contemporary organisms suggest that evolution has repeatedly produced various ways of accomplishing the same tasks. Estimates of the size of various core genomes are shown in Figure I at the basis of their group by solid red and blue circles. Numbers of genes in core genomes are extracted from HOGENOM database (http://pbil.univ-lyon1.fr/databases/hogenom.html), except for Streptococcus agalactiae [18], Escherichia coli [17] and Haloquadratum walsbyi [19]. The estimate of the core genome not only depends on the phylogenetic depth of the group considered but also on the number of genomes available for comparison (Bacillales, 14 genomes; Lactobacillales, 10 genomes; g-proteobacteria, 38 genomes). The pan genome is defined as the union of all the genes that can be found in a species. Gray circles represent the pan genomes of three species of bacteria and one archaeon. Tettelin et al. [18] showed that the pan genome of the clonal organism Bacillus anthracis can be described with only four of the eight genomes sequenced to date and the authors therefore proposed that the pan genome of this species is closed (solid circle). By contrast, for E. coli and S. agalactiae, the pan genome is still significantly growing with every new genome sequenced. Tettelin et al. [18] showed that the number of genes of the pan genome is far from reaching a plateau, and argued that these pan genomes were open (dashed circles), which was also confirmed for E. coli [17]. The pan genome of the archaeon H. walsbyi was collected and evaluated by a metagenomics approach [19]. Because only one genome was entirely sequenced for this species, the core genome is not known but its size is necessarily smaller than the 2800 genes present in the complete genome sequence. Figure I. Gene repertoires in the tree of life. also suggested an open pan genome for this species, but with a much more imposing pan genome as each sequenced genome was reported to contribute >440 genes [17]. These results suggest that the gene pool available in the microbial world is far larger than previously thought, and that the pan genome of a species is typically several times bigger than its core genome. Based on the random sequencing of environmental samples, metagenomics studies have provided the first www.sciencedirect.com glimpse at the tremendous diversity of these genetic surroundings. In a recent paper, Legault et al. [19] suggested that the pan genome of the square halophilic archaeon Haloquadratum walsbyi, analyzed in an environmental genomic assay, was at least twice as big as the genome of the sequenced strain. Furthermore, most of these additional sequences exhibited atypical GC content and were associated with insertion sequence (IS) elements and phage sequences, suggesting a role for horizontal gene 138 Review TRENDS in Microbiology Vol.15 No.3 transfer (HGT) in the maintenance of this accessory gene pool (see later). Many environments have been analyzed for their gene content, from the human distal gut, in which 16 novel bacterial phylotypes and 60 uncultured species within the 151 bacterial phylotypes analyzed were discovered [20], to the Sargasso Sea where more than a billion nonredundant base pairs were sequenced [21]. Recently, Edwards et al. [22] analyzed the metagenomes isolated from two sites of a deep mine and discovered a great diversity of species and significant differences in metabolic potential between these neighboring spots. Not only do these studies support the vision of an outstanding genetic diversity of microbes, but they also demonstrate the possibility of sequencing unculturable microorganisms and provide elements with which to compare the structure of ecological systems. The evolution of gene repertoires The mechanisms explaining this diversity of gene repertoires in genomes [i.e. the processes by which genes are gained and lost (Figure 2)] have been the subject of numerous studies. Before complete genomes from different strains of the same species or from closely related species were available, nonfunctional genes or pseudogenes were thought to be rare in bacteria. The first reports of a significant number of pseudogenes were in pathogens undergoing strong genome reduction such as Rickettsia prowazekii or Mycobacterium leprae, but free-living bacteria were believed to contain relatively few pseudogenes. The first release of the E. coli MG1665 genome was reported to contain only one pseudogene. A recent approach based on comparisons of closely related genomes [23,24] has shown that this genome contains 100 genes that are >80% shorter than their orthologs in other E. coli strains, and are therefore likely to be pseudogenes. The most frequent causes of gene disruption are frameshifts and truncations but some recent pathogens such as Yersinia pestis [25,26] or Shigella flexneri [27,28] exhibit a high proportion of pseudogenes due to the introduction of IS elements, probably as a result of relaxed selection pressure Figure 2. The dynamics of genome repertoire. Bacterial genomes are dynamic entities that constantly gain (left; blue boxes) and lose genes (right; beige boxes). These modifications of gene repertoires arise by different mechanisms. First, bacterial genomes can acquire genetic material from other organisms, even distantly related ones. Horizontal gene transfers are evidenced by different types of approaches that generally identify distinct sets of genes. (i) Analysis of gene composition (GC%, codons) identifies mostly genes that are rarely found in other species. This generally precludes a confirmation of their foreign status by a phylogenetic analysis. However, a mapping of gene presence on a phylogenetic tree of complete genomes can confirm that they have been recently acquired in the genome [41]. By contrast, phylogenetic analysis can reveal HGT for genes that have wider phylogenetic distribution, and these genes only rarely show a striking difference in composition [36]. In this case, HGT can result in the addition of a completely new gene (i), the replacement of an existing gene (ii) or genetic redundancy (iii) if a homologous gene is already present in the recipient genome. Genetic redundancy can also arise from gene duplication and only phylogenetic analysis can distinguish between these two origins. Recent analyses have demonstrated that HGT participates significantly in the degree of redundancy in a bacterial genome [30]. Gene excision and formation of pseudogenes are the mechanisms for gene loss. Excision occurs when a gene is completely deleted from the genome, and pseudogene formation occurs when mutations (point mutation and/or insertion/deletion) accumulate, resulting in function loss. The loss of a gene is evidenced by the absence of the gene in the analyzed genome whereas it is present in related species (phylogenetic mapping). Pseudogenes can be identified by comparisons of closely related genomes [23]. www.sciencedirect.com Review TRENDS in Microbiology owing to a recent bottleneck in their population size. These results have shown that pseudogenes are more abundant than previously thought in bacterial genomes but are subject to quick elimination once disrupted because only a small proportion of them are conserved long enough to be found in several strains. These recurrent losses of genes and functions must be compensated by the acquisition of new genetic material. In eukaryotes, the evolution of new genes is thought to occur mainly through duplication followed by sub- or neo-functionalization of one or both resulting copies. But prokaryotes can integrate genes of diverse origin into their genomes through HGT, which is believed to have a crucial role in speciation and prokaryotic adaptation to new environments [29]. The question of how much duplication and HGT are contributing to genetic novelty has thus been investigated in bacteria. Using maximum likelihood tests to compare phylogenetic gene trees, Lerat et al. [30] have shown that a large proportion of the genes that are usually deemed as duplicates in bacterial genomes are more likely to be genes that have been acquired by HGT while they already had a homolog present in the recipient genome. Another study confirmed the dominant role of HGT over duplication to the evolution of the E. coli metabolic network [31]. However, the relative role of HGT and duplication might vary significantly among species: recent studies of two large bacterial genomes, Myxococcus xanthus [32] and Burkholderia xenovorans LB400 [33] (9.14 Mbp and 9.73 Mbp, respectively), estimated that HGT and duplication contributed in equal proportions to their gene repertoires (15–20%). This amount of duplicates is exceptional in bacteria and has been proposed to be correlated with specific ecological or behavioral needs, such as cellular communication for the social M. xanthus and the ability to adapt to different nutrient sources for B. xenovorans. Interestingly, the contribution of HGT and duplications to genome content does not only vary among bacterial groups but also within taxa; for example, other strains of the species B. xenovorans do not harbor such an amount of redundancy in their genomes [33]. The number of foreign genes present in the genome of E. coli was estimated to be >10% before the era of comparative genomics, when Médigue et al. [34] analyzed the codon composition of about a third of its genes. Later, similar analyses that searched for genes having atypical features in a genome revealed that the number of HGT events varies drastically among species, from zero to >20% [29,35]. However, confirming the status of these foreign genes by independent approaches is difficult [36]. It might be significant that among thousands of HGT detected using nucleotide compositions, the example chosen by Nakamura et al. [35] to illustrate a confirmation by phylogenetic analysis was later pointed out to have been deliberately introduced in Neisseria meningitidis by genetic modification to reduce virulence [37,38]. Nevertheless, most of these genes are probably genuine HGT, as demonstrated by comparisons of genome content in a phylogenetic framework [37,39,40]: the distribution of genes with atypical codon composition on a species phylogeny strongly suggests that most of them are transmitted horizontally. The origin of these genes can rarely be www.sciencedirect.com Vol.15 No.3 139 confirmed by phylogenetic analyses because most of them have no known homologs in databases. The fact that HGT are strongly enriched in these orphan genes or ‘ORFans’ again points at the ‘open’ pan genome and the tremendous diversity of the available pool of genes. A solution to the dilemma of this infinite pool of available proteins has been proposed by Daubin and Ochman [41,42]: many ORFans show characteristics that are strikingly similar to bacteriophage- and plasmid-specific genes, and could be continuously generated there through their exceptionally high mutation rates and opportunities for heterologous recombination. Although most of these genes are probably deleterious or useless for the transducted host, evidence exists that such genes can prove useful for their cellular recipient and can even become essential and, ultimately, become incorporated into the core genome of a species [41,42]. This hypothesis has been tested by searching for homologs of ORFans in databases of bacteriophage genes [41,43]. In their recent study, Yin and Fischer [43] found that only a few genomes show evidence for ORFans having more homologs in phage genomes than other genes. However, their study showed that databases of viral genes are strongly biased toward bacteriophages associated with gproteobacteria and Firmicutes, and that ORFans from both of these groups show significantly higher homology to bacteriophage than other genes. Although the role of bacteriophage in generating ORFans seems significant in these groups, a more representative sample of the diversity of phages would be necessary to generalize this result to other bacteria. Although ORFans and uncharacterized genes explain a significant part of the diversity of gene repertoires in bacteria, there is also strong evidence that distantly related bacteria exchange genes and that these transfers have a key role in the acquisition of new capabilities and the adaptation to new environments [29,44]. Such genes are usually less well detected by codon composition analyses and are more readily found by incongruent trees or sporadic occurrence in the phylogeny of bacteria [36]. HGT and the evolution of gene networks One of the important questions raised by HGT is how a newly acquired gene fits into the complex cellular network of the recipient organism. Based on an analysis of congruence of gene phylogenies, Jain et al. [45] proposed the ‘complexity hypothesis’ that genes might have different probabilities of being transferred depending on how many interacting partners their products have in the cell. Most notably, genes involved in translation and transcription, most of them part of protein complexes, were found to show fewer indications of transfers. More recently, Pal et al. [31,46] analyzed the metabolic network of E. coli to study the influence of the metabolic network on the probability of gene transfers. The success of a HGT was found to depend on the pathway it affected, with an HGT that intervened in a peripheral pathway [46] or having physiologically interacting partners already present in the genome [31] being more likely to be fixed. According to Pal et al. [46], prokaryotic gene networks evolve by continuous addition of peripheral functions that are more directly involved in interacting with the environment. This view contrasts with 140 Review TRENDS in Microbiology Vol.15 No.3 the model proposed by Teichmann and Madan Babu [47] for the evolution of regulatory networks in which 45% of the regulatory interactions in E. coli arose by duplication and inheritance of interaction. However, these views are not necessarily incompatible because Lerat et al. [30] showed that many genes traditionally identified as duplicates in bacterial genomes might have arisen from HGT of a gene that possessed a homolog in the recipient genome, a possibility not considered in the study by Teichmann and Madan Babu [47]. Starting from the gene content of the E. coli regulatory network [48,49], Hershberg and Margalit [50] investigated the conservation of transcription factors (TFs) and their targets among g-proteobacteria. They found that repressors co-occur with their targets, while activators can be lost independently of their targets. This suggests a differential evolving mechanism to turn off a regulatory pathway: the loss of TFs is sufficient in the case of positive regulation whereas in the case of negatively regulated pathways, the loss of a repressor can have strong negative effects by constitutively expressing the target function. Madan Babu et al. [51] and Lozada-Chavez et al. [52] did similar studies at a higher evolutionary scale – they analyzed 175 (bacterial and archaeal) and 110 (bacterial) genomes, respectively, and showed a lower conservation of TFs compared with their targets. A limitation of these studies is that the gene networks of only a few model organisms have been studied experimentally and networks are generally reconstructed by searching for homologous genes in other genomes. However, the inference of a protein function based on comparative analysis is not straightforward, especially when divergent organisms are considered because homologous genes can encode different functions [53,54]. In these analyses of the evolution of gene networks, homology of function is generally inferred solely on BLAST searches and the risk of assigning erroneous functions to proteins can be high. Because gene histories intermingle evolutionary events such as duplications, gene transfers, gene losses and speciation, the assessment of the type of homologous relationships among genes is a crucial point in comparative genomics. Phylogenomics and the problem of HGT Comparative genomics has raised the issue of HGT and the concept of species but it has also provided a large amount of new phylogenetic markers and stimulated the field of phylogenetics. Numerous phylogenomic methods were recently developed to use complete genome data (see Ref. [55] for a review) and were used to attempt to reconstruct the tree of life [56] or to test for the existence of a phylogenetic signal. With the finding of the extent of HGT, the Darwinian tree-like representation of relationships between species has been questioned by some authors asserting that HGT events are so ‘rampant’ that genes cannot be used as reliable phylogenetic markers. They propose a network of species [44], arguing that a signal of vertical inheritance cannot be unraveled from horizontal signals due to HGT. This idea is hotly controversial because several studies www.sciencedirect.com showed the existence of a predominant signal for an organismal phylogeny [29,30,57–59]. It was thought that using a large dataset to reconstruct phylogenetic trees would ensure that a powerful signal would emerge to resolve phylogenetic relationships, provided that HGT had been adequately identified [56,60]. However, a recent study showed that, even independently of HGT, population genetics predicts a great deal of incongruence among gene trees and that combining even large amounts of these data would not help. Degnan and Rosenberg [61] have simulated the evolution of genes using a coalescence model and shown that, especially in the case of deep trees, gene trees can often conflict with species trees even without exchanges among lineages. This effect would probably be particularly important in prokaryotes because the time of coalescence can be greater with larger population sizes. Therefore, the conflict observed among gene trees might not be only the result of HGT, phylogenetic artifacts or hidden paralogy, but also of a genuine vertical descent in which polymorphic alleles cohabit for a long time in populations. Future attempts to assess the degree of incongruence among gene trees and to reconstruct the tree of life will have to take into account this possible effect. Concluding remarks and future perspectives The amount of data generated by genome projects stimulates many fields of biology and has a deep impact on our vision of the evolution and the organization of life. Bacteria have been found to be far more diverse, complex and variable than ever suspected but, although their genomes exhibit striking differences in gene contents, they show an organization based on the same principles. The necessity to replicate and express their genome simultaneously imposes constraints on gene dosage and arrangement. How this organization is exploited for new adaptations in an ever-changing genome constantly impacted by HGT has yet to be understood. However, comparative genomics has already enabled the identification of some of the mechanisms that determine the acquisition of new genes and functions, and how they integrate in the cellular network. The development of metagenomics will enable a better description of the genetic environment of organisms and an understanding of the possible functional innovations that can arise from HGT. The role of HGT seems to be crucial but one should not consider that gene exchanges have been so profound as to preclude the reconstruction of the history of life, in the sense of understanding how genomes have evolved to what they are. More integrative approaches combining information from species phylogenies, gene histories, ecology and cellular networks will be necessary to tell the chronicles of contemporary genomes. But the comparative analysis of genomes of different domains of life, cellular and viral organisms already suggests that this story is a tale of invasions, exchanges and conflicts turning into cooperation. Acknowledgements We would like to thank Daniel Kahn, Sylvain Mousset, Bastien Boussau and five anonymous reviewers for their helpful comments on the manuscript. S.A. is the recipient of a fellowship from the Ministère de l’Education Nationale de la Recherche et de la Technologie. Review TRENDS in Microbiology References 1 Binnewies, T.T. et al. (2006) Ten years of bacterial genome sequencing: comparative-genomics-based discoveries. Funct. Integr. Genomics 6, 165–185 2 Moran, N.A. (2002) Microbial minimalism: genome reduction in bacterial pathogens. Cell 108, 583–586 3 Lynch, M. and Conery, J.S. (2003) The origins of genome complexity. Science 302, 1401–1404 4 Alain, K. et al. (2002) Caminicella sporogenes gen. nov., sp. nov., a novel thermophilic spore-forming bacterium isolated from an East-Pacific Rise hydrothermal vent. Int. J. Syst. Evol. Microbiol. 52, 1621–1628 5 Novikova, N. et al. (2006) Survey of environmental biocontamination on board the International Space Station. Res. Microbiol. 157, 5–12 6 Kobayashi, K. et al. (2003) Essential Bacillus subtilis genes. Proc. Natl. Acad. Sci. U. S. A. 100, 4678–4683 7 Ochman, H. (2002) Distinguishing the ORFs from the ELFs: short bacterial genes and the annotation of genomes. Trends Genet. 18, 335– 337 8 Vogel, J. and Sharma, C.M. (2005) How to find small non-coding RNAs in bacteria. Biol. Chem. 386, 1219–1238 9 Lobry, J.R. and Sueoka, N. (2002) Asymmetric directional mutation pressures in bacteria. Genome Biol. 3, RESEARCH0058 10 Rocha, E.P. (2004) Order and disorder in bacterial genomes. Curr. Opin. Microbiol. 7, 519–527 11 Rocha, E.P. and Danchin, A. (2003) Gene essentiality determines chromosome organisation in bacteria. Nucleic Acids Res. 31, 6570–6577 12 Daubin, V. and Perrière, G. (2003) G + C3 structuring along the genome: a common feature in prokaryotes. Mol. Biol. Evol. 20, 471–483 13 Couturier, E. and Rocha, E.P. (2006) Replication-associated gene dosage effects shape the genomes of fast-growing bacteria but only for transcription and translation genes. Mol. Microbiol. 59, 1506–1518 14 Koonin, E.V. (2003) Comparative genomics, minimal gene-sets and the last universal common ancestor. Nat. Rev. Microbiol. 1, 127–136 15 Lerat, E. et al. (2003) From gene trees to organismal phylogeny in prokaryotes: the case of the g-proteobacteria. PLoS Biol. 1, E19 16 Welch, R.A. et al. (2002) Extensive mosaic structure revealed by the complete genome sequence of uropathogenic Escherichia coli. Proc. Natl. Acad. Sci. U. S. A. 99, 17020–17024 17 Chen, S.L. et al. (2006) Identification of genes subject to positive selection in uropathogenic strains of Escherichia coli: a comparative genomics approach. Proc. Natl. Acad. Sci. U. S. A. 103, 5977–5982 18 Tettelin, H. et al. (2005) Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial ‘‘pan-genome’’. Proc. Natl. Acad. Sci. U. S. A. 102, 13950–13955 19 Legault, B.A. et al. (2006) Environmental genomics of ‘‘Haloquadratum walsbyi’’ in a saltern crystallizer indicates a large pool of accessory genes in an otherwise coherent species. BMC Genomics 7, 171 20 Gill, S.R. et al. (2006) Metagenomic analysis of the human distal gut microbiome. Science 312, 1355–1359 21 Venter, J.C. et al. (2004) Environmental genome shotgun sequencing of the Sargasso Sea. Science 304, 66–74 22 Edwards, R.A. et al. (2006) Using pyrosequencing to shed light on deep mine microbial ecology. BMC Genomics 7, 57 23 Lerat, E. and Ochman, H. (2005) Recognizing the pseudogenes in bacterial genomes. Nucleic Acids Res. 33, 3125–3132 24 Lerat, E. and Ochman, H. (2004) Psi-Phi: exploring the outer limits of bacterial pseudogenes. Genome Res. 14, 2273–2278 25 Deng, W. et al. (2002) Genome sequence of Yersinia pestis KIM. J. Bacteriol. 184, 4601–4611 26 Parkhill, J. et al. (2001) Genome sequence of Yersinia pestis, the causative agent of plague. Nature 413, 523–527 27 Jin, Q. et al. (2002) Genome sequence of Shigella flexneri 2a: insights into pathogenicity through comparison with genomes of Escherichia coli K12 and O157. Nucleic Acids Res. 30, 4432–4441 28 Wei, J. et al. (2003) Complete genome sequence and comparative genomics of Shigella flexneri serotype 2a strain 2457T. Infect. Immun. 71, 2775–2786 29 Ochman, H. et al. (2000) Lateral gene transfer and the nature of bacterial innovation. Nature 405, 299–304 30 Lerat, E. et al. (2005) Evolutionary origins of genomic repertoires in bacteria. PLoS Biol. 3, 130 www.sciencedirect.com Vol.15 No.3 141 31 Pal, C. et al. (2005) Adaptive evolution of bacterial metabolic networks by horizontal gene transfer. Nat. Genet. 37, 1372–1375 32 Goldman, B.S. et al. (2006) Evolution of sensory complexity recorded in a myxobacterial genome. Proc. Natl. Acad. Sci. U. S. A. 103, 15200– 15205 33 Chain, P.S. et al. (2006) Burkholderia xenovorans LB400 harbors a multi-replicon. 9.73-Mbp genome shaped for versatility. Proc. Natl. Acad. Sci. U. S. A. 103, 15280–15287 34 Médigue, C. et al. (1991) Evidence for horizontal gene transfer in Escherichia coli speciation. J. Mol. Biol. 222, 851–856 35 Nakamura, Y. et al. (2004) Biased biological functions of horizontally transferred genes in prokaryotic genomes. Nat. Genet. 36, 760–766 36 Ragan, M.A. (2001) On surrogate methods for detecting lateral gene transfer. FEMS Microbiol. Lett. 201, 187–191 37 van Passel, M. et al. (2004) Phylogenetic validation of horizontal gene transfer? Nat. Genet. 36, 1028 38 Tettelin, H. and Parkhill, J. (2004) The use of genome annotation data and its impact on biological conclusions. Nat. Genet. 36, 1028– 1029 39 Daubin, V. et al. (2003) The source of laterally transferred genes in bacterial genomes. Genome Biol. 4, R57 40 Daubin, V. et al. (2003) Phylogenetics and the cohesion of bacterial genomes. Science 301, 829–832 41 Daubin, V. and Ochman, H. (2004) Bacterial genomes as new gene homes: the genealogy of ORFans in E. coli. Genome Res. 14, 1036– 1042 42 Daubin, V. and Ochman, H. (2004) Start-up entities in the origin of new genes. Curr. Opin. Genet. Dev. 14, 616–619 43 Yin, Y. and Fischer, D. (2006) On the origin of microbial ORFans: quantifying the strength of the evidence for viral lateral transfer. BMC Evol. Biol. 6, 63 44 Gogarten, J.P. et al. (2002) Prokaryotic evolution in light of gene transfer. Mol. Biol. Evol. 19, 2226–2238 45 Jain, R. et al. (1999) Horizontal gene transfer among genomes: the complexity hypothesis. Proc. Natl. Acad. Sci. U. S. A. 96, 3801–3806 46 Pal, C. et al. (2005) Horizontal gene transfer depends on gene content of the host. Bioinformatics 21 (Suppl 2), ii222–ii223 47 Teichmann, S.A. and Madan Babu, M. (2004) Gene regulatory network growth by duplication. Nat. Genet. 36, 492–496 48 Shen-Orr, S.S. et al. (2002) Network motifs in the transcriptional regulation network of Escherichia coli. Nat. Genet. 31, 64–68 49 Salgado, H. et al. (2004) RegulonDB (version 4.0): transcriptional regulation, operon organization and growth conditions in Escherichia coli K-12. Nucleic Acids Res. 32, D303–D306 50 Hershberg, R. and Margalit, H. (2006) Co-evolution of transcription factors and their targets depends on mode of regulation. Genome Biol. 7, R62 51 Madan Babu, M. et al. (2006) Evolutionary dynamics of prokaryotic transcriptional regulatory networks. J. Mol. Biol. 358, 614–633 52 Lozada-Chavez, I. et al. (2006) Bacterial regulatory networks are extremely flexible in evolution. Nucleic Acids Res. 34, 3434–3445 53 Eisen, J.A. (1998) Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Res. 8, 163– 167 54 Lazareva-Ulitsky, B. et al. (2005) On the quality of tree-based protein classification. Bioinformatics 21, 1876–1890 55 Delsuc, F. et al. (2005) Phylogenomics and the reconstruction of the tree of life. Nat. Rev. Genet. 6, 361–375 56 Ciccarelli, F.D. et al. (2006) Toward an automatic reconstruction of a highly resolved tree of life. Science 311, 1283–1287 57 Ge, F. et al. (2005) The cobweb of life revealed by genome-scale estimates of horizontal gene transfer. PLoS Biol. 3, e316 58 Beiko, R.G. et al. (2005) Highways of gene sharing in prokaryotes. Proc. Natl. Acad. Sci. U. S. A. 102, 14332–14337 59 Ochman, H. et al. (2005) Examining bacterial species under the specter of gene transfer and exchange. Proc. Natl. Acad. Sci. U. S. A. 102, 6595– 6599 60 Brown, J.R. et al. (2001) Universal trees based on large combined protein sequence data sets. Nat. Genet. 28, 281–285 61 Degnan, J.H. and Rosenberg, N.A. (2006) Discordance of species trees with their most likely gene trees. PLoS Genet. 2, e68