* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Separating derived from ancestral features of mouse and human
Quantitative trait locus wikipedia , lookup
Adaptive evolution in the human genome wikipedia , lookup
Essential gene wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Gene therapy wikipedia , lookup
Genetic engineering wikipedia , lookup
Gene nomenclature wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Oncogenomics wikipedia , lookup
Human genetic variation wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Transposable element wikipedia , lookup
Gene expression programming wikipedia , lookup
Gene desert wikipedia , lookup
Non-coding DNA wikipedia , lookup
Copy-number variation wikipedia , lookup
Genomic library wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Metagenomics wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Ridge (biology) wikipedia , lookup
Segmental Duplication on the Human Y Chromosome wikipedia , lookup
Genomic imprinting wikipedia , lookup
Public health genomics wikipedia , lookup
History of genetic engineering wikipedia , lookup
Gene expression profiling wikipedia , lookup
Helitron (biology) wikipedia , lookup
Human genome wikipedia , lookup
Human Genome Project wikipedia , lookup
Pathogenomics wikipedia , lookup
Genome (book) wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Genome editing wikipedia , lookup
Microevolution wikipedia , lookup
Minimal genome wikipedia , lookup
Designer baby wikipedia , lookup
734 Biochemical Society Transactions (2009) Volume 37, part 4 Separating derived from ancestral features of mouse and human genomes Chris P. Ponting1 and Leo Goodstadt MRC Functional Genomics Unit, University of Oxford, Department of Physiology, Anatomy and Genetics, South Parks Road, Oxford OX1 3QX, U.K. Abstract To take full advantage of the mouse as a model organism, it is essential to distinguish lineage-specific biology from what is shared between human and mouse. Investigations into shared genetic elements common to both have been well served by the draft human and mouse genome sequences. More recently, the virtually complete euchromatic sequences of the two reference genomes have been finished. These reveal a high (∼5%) level of sequence duplications that had previously been recalcitrant to sequencing and assembly. Within these duplications lie large numbers of rodent- or primate-specific genes. In the present paper, we review the sequence properties of the two genomes, dwelling most on the duplications, deletions and insertions that separate each of them from their most recent common ancestor, approx. 90 million years ago. We consider the differences in gene numbers and repertoires between the two species, and speculate on their contributions to lineage-specific biology. Loss of ancient single-copy genes are rare, as are gains of new functional genes through retrotransposition. Instead, most changes to the gene repertoire have occurred in large multicopy families. It has been proposed that numbers of such ‘environmental genes’ rise and fall, and their sequences change, as adaptive responses to infection and other environmental pressures, including conspecific competition. Nevertheless, many such genes may be under little or no selection. Introduction During the Cretaceous Period, approx. 90 million years ago, a line of small mammals, probably similar in appearance to modern tree shrews, split into two lineages. One of these led to the primates, including Homo sapiens, the other to rodents such as the laboratory mouse, Mus musculus. Despite the obvious morphological differences between humans and mice, close observation over 100 years of mouse genetics has demonstrated the many close anatomical and physiological affinities between humans and mice [1,2]. Yet how are these species similar or different in their genes and in genomes? How much has changed in these genomes in the last 90 million years, and how much has resisted change to preserve essential ancestral functions? If Darwinian adaptation has been responsible for the divergence between rodents and primates, would this have left a corresponding signature in their genomes? To answer these and other questions the sequencing of human and mouse genomes was required, followed by detailed comparisons pinpointing those nucleotides or exons or genes or chromosomal segments that have remained intact in both lineages since their last common ancestor. The draft genome sequences of humans in 2001 [3] and mice in 2002 [4] were great boons for research in genomics, genetics and evolution. Yet, these draft sequences were never intended to be comprehensive. It was always realized that Key words: comparative genomics, copy number variant, functional innovation, gene duplication, lineage-specific biology, mammalian evolution. Abbreviation used: ORF, open reading frame. 1 To whom correspondence should be addressed (email [email protected]). C The C 2009 Biochemical Society Authors Journal compilation sequencing of the extremely repetitive and transposon-rich heterochromatin would be too technically challenging. However, because heterochromatin is believed to have low gene density and is largely silenced, its absence has been met with little complaint. On the other hand, it was of great concern that a large amount of euchromatic sequence (>10% of human and >6.5% of mouse) was also found to be absent from the initial drafts. The process by which this additional sequence has been painstakingly added to genome assemblies is known as ‘finishing’, although even ‘finished’ euchromatin is not perfect and will contain a low level of missing and inaccurate sequence for some time to come [5]. The ‘finished’ human genome sequence was reported in 2004 [5] and a similar sequence for mice will be reported elsewhere soon [6]. The purpose of the present review is to provide a brief overview of the similarities and differences between these two genome sequences in the light of newfound sequence previously absent from draft assemblies. Mind the gaps The human genome assembly extends to 3.09 Gb (Table 1), containing 99% of the euchromatin and 94% of the entire genome [5]. This is a slightly larger genome assembly than for the mouse, which comprises approx. 2.66 Gb. Similarities between chromosome numbers and genome sizes belie the substantial rearrangement of the ancestral genome, particularly in the rodent lineage, with over 300 large (>300 kb) blocks of genes being reordered, and many more smaller scale rearrangements observed [4]. Across the whole Biochem. Soc. Trans. (2009) 37, 734–739; doi:10.1042/BST0370734 Protein Evolution: Sequences, Structures and Systems Table 1 Properties of finished human and mouse reference Figure 1 Mouse and human gene duplication as a function of age genome assemblies NCBI Builds 36.1 and 36 respectively. These are extant genes from the two reference genome assemblies. Their age of duplication has been estimated phylogenetically from the Property Human Mouse Assembled genome size (Gb) 3.091 2.661 Segmentally duplicated sequence (Mb) Interspersed repeats (Gb) 159.2 (5.52%) 126.0 (4.94%) 1.406 1.091 Number of gaps Sequence in gaps (Mb) Number of gene models 118 9.343 19042 1218 6.088 20210 Coding sequence (%) 1.07 1.27 of the human genome, only 40% of sequence can be aligned to the mouse genome, with much of the remainder representing remnants of transposable elements that have been frequently inserted and deleted in each lineage, much of which has decayed beyond recognition. Finishing the genome assemblies revealed that the draft assemblies were particularly deficient in segmental duplications, defined as >1 kb fragments of genomic sequence with high sequence identity (>90%) that map to multiple locations [7]. The repetitive nature of this sequence explains its recalcitrance to assembly, especially via the whole genome shotgun approach. Segmental duplications are now known to cover approx. 5% of both human and mouse euchromatin [8]. These newly discovered duplicated sequences would perhaps be only of passing interest, except that they contain a high density of duplicated protein-coding genes (see below). These same regions also tend to be highly variable in structure, including copy number, among unrelated human or mouse individuals [9–11]. The mouse sequence is from a single highly inbred individual female from the laboratory black 6 (C57BL/6) strain. However, the human reference genome represents the agglomeration of contributions from an anonymous panel of outbred individuals [3], although more than half the assembly stems from a single contributor. Each individual carries ∼0.5% copy number variant sequence, often in a heterozygous state [12]. The human genome sequence is thus not a consensus representing the most frequent variants. Instead, because of selection for size in choosing insert clones for sequencing, parts of the human reference genome may contain a systematic bias towards the incorporation of variable regions, especially those with high copy numbers. In any case, even for the mouse where copy number variants are seen in abundance between different inbred mouse strains [8], structural variation between individuals highlights the limitations of a single reference genome in representing an entire population. Gene repertoires Interest in relating gene numbers to organismal complexity has waned considerably since the discovery that the genome ratio of the duplicates’ branch length to the distance to the rodent primate speciation node, and by assuming that the split occurred 90 million years ago. of the simple nematode Caenorhabditis elegans contains almost as many protein-coding genes as humans [13–15]. Nevertheless, the reduction of the human gene count from an initial 32 000 in the draft human genome publication [3] to its current level of approx. 19 000 shows the many inherent difficulties in gene predictions as well as the great progress that has been made since. Initial gene counts were greatly inflated by the inclusion of fragmentary gene models, many of which represent the debris of non-functional pseudogenes. Spurious ORFs (open reading frames) present by chance in RNA transcripts were also misidentified even in the absence of either protein-coding potential or evolutionary conservation [15]. Disruptions to putative ORFs such as small insertions, deletions and frameshifts were difficult to distinguish from sequence errors that are often found among even bona fide gene predictions in a draft genome sequence. Finally, it has never been straightforward to distinguish one gene from its genomic neighbour, particularly as (non-coding or chimaeric) transcripts can often be found to span the two [16,17]. The current low estimates of gene number in the human genome [13–15] result from the exclusion of erroneous gene models that fail criteria for protein-coding potential and conservation. These rely upon three widely held observations: that intron positions and especially phase are very well conserved within coding sequence, that confirmed cases of mammalian protein-coding genes recruited de novo from non-coding sequence are very rare, and that retrotransposition generally results in non-functional sequence with immediate loss of constraint. This last assumption appears to be the most fallible, since promoter elements are sometimes fortuitously present at the 5 -end of the integration site of retrotransposons. However, empirically, because only about one functional retrogene arises per million years, this results C The C 2009 Biochemical Society Authors Journal compilation 735 736 Biochemical Society Transactions (2009) Volume 37, part 4 in only a small underestimation of ∼100 genes since the human mouse divergence [18,19]. The great majority of genes are thus gained in each lineage not by de novo creation or retrotransposition, but instead by whole gene duplication, often in tandem copies; simultaneously, functional genes are lost by deletion or after disruptive mutations (‘pseudogenization’). The fates of genes on each of the human and mouse lineages are thus best described by gene births and deaths in the time since their last common ancestor approx. 90 million years ago. Figure 1 shows the rate of accumulation of extant genes in both Mus musculus and Homo sapiens lineages. The large numbers of the most recently duplicated genes largely reflect copy number variable genes: many of these are unlikely to be fixed in future populations (see below). Correspondence between mouse and human genes To a Well-Connected Mouse (Upon reading of the genetic closeness of mice and men.) Wee, sleekit, cow’rin, tim’rous beastie, Braw science says that at the leastie We share full ninety-nine per cent O’genes, where’er the odd ane went. c 2003 John Updike. Originally published in The New Yorker. All rights reserved. This reworking of the 18th Century Robert Burns poem (“To a Mouse”) was sparked by the Nature editorial commentary that 99% of mouse genes have “direct counterparts in humans” [20]. This statement unfortunately gives the misleading impression that human and mouse genes mostly correspond, and that divergence in their genomes is thus largely either within the sequence of each corresponding gene, or in regulatory regions. In fact, the mouse and human gene repertoires have been dramatically remodelled since the divergence of the lineages ∼90 million years ago: 20% (3852 of 19042) of human genes and 24% (5020 of 20210) of mouse genes are duplicate copies which have arisen since their last common ancestor (Figure 2). As for “where’er the odd ane went”, the Nature commentary reflects that in fewer than 1% of genes has every homologue become extinct in the other lineage. These are genes that had long persisted before the common ancestor and yet have latterly become dispensable. This may no doubt reflect profound changes to the functional repertoire. An example of a loss of an erstwhile essential gene is that of EYS whose disruption in humans results in a form of adolescent onset blindness (retinitis pigmentosa), but whose loss approx. 80 million years ago seems to have little affected the rodents [21]. A more recent example in evolution involves the human-specific loss of the myosin heavy chain MYH16 which has been argued in a ‘less is more’ hypothesis to confer a selective benefit in increasing the cranial capacity [22]. Approx. 15 187 genes have remained unduplicated in both human and mouse lineages since their last common ancestor C The C 2009 Biochemical Society Authors Journal compilation Figure 2 Mouse genes have a higher synonymous nucleotide substitution rate (dS ) and have accumulated more lineage-specific duplicates than human genes (A) Mouse and human phylogeny drawn to the dS scale. (B) The number of 1:1 mouse and human orthologues (black) and the number of gene duplicates unique to each species (grey). Table 2 Properties of human and mouse simple 1:1 orthologues Properties are median values. d N , non-synonymous substituion; d S , synonymous substitution. Property Value Counts of 1:1 orthologues 15 187 dN dS d N /d S ratio 0.057 0.58 0.095 Amino acid sequence identity (%) Coding sequence identity (%) Aligned sequence length (codons) 88.2 85.3 434 (Table 2). These ‘simple orthologues’ have also substantially conserved their sequences: their median amino acid and nucleotide identities are 88 and 85% respectively. These are the genes expected to convey much of the functional repertoire that is conserved among mammals and, more broadly, among other animals. It is thus appropriate that these simple orthologues are being specifically targeted for disruption in large-scale phenotypic screens in the mouse to illuminate conserved mammalian biology. The estimated gene count is substantially higher in mice (20 210) than in humans (19 042) because of the larger number of rodent-specific gene duplicates. This is almost entirely due to genes with roles in chemosensation, such as olfactory and vomeronasal receptors and pheromone genes. The cull of primate-specific chemosensation genes relative to murid rodents and other mammals seems to have accelerated in the oldworld primates in the last ∼25 million years [23,24]. The contrast in gene repertoires perhaps reflects the divergent sensory requirements for mammals that differ in diet and behaviour. Gene duplications in mice are predominantly found in tandem copies of genomic sequence [8], and most probably arise via unequal crossover or non-allelic homologous recombination. In contrast, human duplicates are more often dispersed to different chromosomes in recombinations that were mediated by the primate-specific Alu-SINE (short interspersed element) retrotransposons [25,26]. Protein Evolution: Sequences, Structures and Systems Large genomic duplications are less frequent than smaller ones. Thus compact genes are more likely to be duplicated with a complete and intact ORF, as well as the accompanying promoter and other necessary non-coding regulatory sequence. Median sizes of human- or mouse-lineage-specific genes are thus approx. 3- and 6-fold smaller than those of simple orthologues (6.6 or 3.3 kb compared with 26.6 or 21.2 kb respectively); they have fewer exons (medians of five and three compared with nine for both mice and humans); and they are 5- and 12-fold (in humans or mice) more likely to contain only a single coding exon. Unequal crossing-over can also generate novel functional genes, either via the formation of chimaeric genes or as partial duplicates (e.g. [27]). For example, a duplication of a fragment of the rodent Dlg5 gene spawned an extended family of testis-specific genes that have become extremely widespread in the mouse and rat genomes [28]. Partial duplications have also occasionally been observed within duplicated genes. The primate-specific LPA gene, for example, was formed initially by a complete duplication of PLG with a subsequent succession of further internal duplications of individual domains [29]. Innovative functions of lineage-specific genes Duplicated genes form a far from random selection of all genes. Aside from the bias in gene sizes mentioned above, several functional categories are overrepresented among genes specific to either primate or rodent lineages. These include (i) chemosensation genes which are more numerous in rodents, and genes associated with (ii) immunity or host defence (e.g. T-cell receptor genes), (iii) with detoxification (e.g. cytochrome P450), and (iv) with reproduction (e.g. pheromone or cancer-testis antigen genes [30]). Many of these genes are of considerable biological interest and are found in regions of the genome rich in segmental duplications and interspersed repeats, and are thus only present in finished genome assemblies. The extreme repetitive nature of the Y chromosome has required particularly focused efforts to complete its sequence. Indeed, further copies of rapidly evolving gene families can be expected to be discovered as the remaining gaps in the human and mouse genomes are closed. These functional biases might imply that gene duplication events are largely adaptive: expansions in the gene repertoire may be a response to environmental challenges from pathogens and parasites, or confer reproductive advantages over other individuals from the same species. Evidence of positive selection at specific codons [dN /dS ratio (ratio of non-synonymous to synonymous substitutions) significantly greater than 1] has often been cited for duplicated gene families (e.g. [31,32]). However, with the small mammalian (especially primate) effective population sizes, chance gene duplications may remain in the population for some time or proceed to fixation when they are not advantageous or even mildly deleterious. This would explain the recent provenance and hence short lifespans of gene duplicates (Figure 1), some of which are unfixed and copy-number-variable within human or mouse populations. The observed overrepresentation of functional categories may simply be because duplications of genes in other classes tend to be deleterious (in part due to stoichiometric constraints) and thus are preferentially purged from the population [33,34]. Even where evidence for positive selection is cited, this may instead be the result of loss of constraint in redundant gene copies after duplication. Other cases of apparent adaptive change can be due to biased gene conversion [35]. This is a bias in the recombination-associated repair process following double-strand breaks and can lead to the erroneous inference of positive selection [36,37]. It is suggested that biased gene conversion among adjacent tandemly duplicated genes may drive their accelerated evolution [38]. On the other hand, the short lifespan of many gene duplicates may point to the short period of advantage that novelty confers in a constantly changing adaptive landscape. A high turnover in the gene repertoire with many births and deaths would be seen in an evolutionary ‘arms race’ with pathogens [39]. There can also be positive selection among genes expressed in germline or stem cells [40–42] which confers no advantage on either the individual or population. Instead, gain-of-function mutations that, for example, favour clonal expansion of mutant spermatogonia [43], may underlie much strong diversifying selection. Almost 40 years after Ohno [44] first proposed that gene duplications are the principal forces for the generation of novelty in evolution, the relative contributions of adaptation and genetic drift in maintaining our gene repertoires thus remain to be determined. The availability of further primate and rodent assembled genomes of high quality on one hand, and extensive allele frequency data for copy number variants in natural primate and rodent populations on the other, may be required to decide between these two evolutionary scenarios [45]. Concluding remarks Strenuous efforts have now provided near-complete euchromatic genome sequence assemblies of humans and mice. These have revealed large numbers of lineage-specific genes lying within segmentally duplicated genomic regions that were previously recalcitrant to sequencing and assembly. Many of these novel genes are studied only rarely, partly because of their relatively late arrival into genome sequence assemblies, and partly due precisely to the absence of single orthologous genes in the other species: inferences from model organisms about lineage-specific human genes have to be treated with the greatest caution, while a case has to be made for the relevance of each mouse-specific gene to human physiology and disease. Nevertheless, if we are to fully appreciate the functional repertoire of our own species, and C The C 2009 Biochemical Society Authors Journal compilation 737 738 Biochemical Society Transactions (2009) Volume 37, part 4 that of the mouse, our most important model organism, then these genes demand much further scrutiny in the future. Funding We gratefully acknowledge support from the Medical Research Council. References 1 Paigen, K. (2003) One hundred years of mouse genetics: an intellectual history. I. The classical period (1902–1980). Genetics 163, 1–7 2 Paigen, K. (2003) One hundred years of mouse genetics: an intellectual history. II. The molecular revolution (1981–2002). Genetics 163, 1227–1235 3 Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W. et al. (2001) Initial sequencing and analysis of the human genome. Nature 409, 860–921 4 Waterston, R.H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J.F., Agarwal, P., Agarwala, R., Ainscough, R., Alexandersson, M., An, P. et al. (2002) Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 5 International Human Genome Sequencing Consortium (2004) Finishing the euchromatic sequence of the human genome. Nature 431, 931–945 6 Church, D.M., Goodstadt, L., Hillier, L.W., Zody, M.C., Goldstein, S., She, X., Bult, C.J., Agarwala, R., Cherry, J.L., DiCuccio, M. et al. (2009) Lineage-specific biology revealed by a finished genome assembly of the mouse. PLoS Biol. 7, e1000112 7 Eichler, E.E., Clark, R.A. and She, X. (2004) An assessment of the sequence gaps: unfinished business in a finished human genome. Nat. Rev. Genet. 5, 345–354 8 She, X., Cheng, Z., Zollner, S., Church, D.M. and Eichler, E.E. (2008) Mouse segmental duplication and copy number variation. Nat. Genet. 40, 909–914 9 Sebat, J., Lakshmi, B., Troge, J., Alexander, J., Young, J., Lundin, P., Maner, S., Massa, H., Walker, M., Chi, M. et al. (2004) Large-scale copy number polymorphism in the human genome. Science 305, 525–528 10 Iafrate, A.J., Feuk, L., Rivera, M.N., Listewnik, M.L., Donahoe, P.K., Qi, Y., Scherer, S.W. and Lee, C. (2004) Detection of large-scale variation in the human genome. Nat. Genet. 36, 949–951 11 Tuzun, E., Sharp, A.J., Bailey, J.A., Kaul, R., Morrison, V.A., Pertz, L.M., Haugen, E., Hayden, H., Albertson, D., Pinkel, D. et al. (2005) Fine-scale structural variation of the human genome. Nat. Genet. 37, 727–732 12 McCarroll, S.A., Kuruvilla, F.G., Korn, J.M., Cawley, S., Nemesh, J., Wysoker, A., Shapero, M.H., de Bakker, P.I., Maller, J.B., Kirby, A. et al. (2008) Integrated detection and population-genetic analysis of SNPs and copy number variation. Nat. Genet. 40, 1166–1174 13 Goodstadt, L. and Ponting, C.P. (2006) Phylogenetic reconstruction of orthology, paralogy, and conserved synteny for dog and human. PLoS Comput. Biol. 2, e133 14 Goodstadt, L., Heger, A., Webber, C. and Ponting, C.P. (2007) An analysis of the gene complement of a marsupial, Monodelphis domestica: evolution of lineage-specific genes and giant chromosomes. Genome Res. 17, 969–981 15 Clamp, M., Fry, B., Kamal, M., Xie, X., Cuff, J., Lin, M.F., Kellis, M., Lindblad-Toh, K. and Lander, E.S. (2007) Distinguishing protein-coding and noncoding genes in the human genome. Proc. Natl. Acad. Sci. U.S.A. 104, 19428–19433 16 Gerstein, M.B., Bruce, C., Rozowsky, J.S., Zheng, D., Du, J., Korbel, J.O., Emanuelsson, O., Zhang, Z.D., Weissman, S. and Snyder, M. (2007) What is a gene, post-ENCODE? History and updated definition. Genome Res. 17, 669–681 17 Denoeud, F., Kapranov, P., Ucla, C., Frankish, A., Castelo, R., Drenkow, J., Lagarde, J., Alioto, T., Manzano, C., Chrast, J. et al. (2007) Prominent use of distal 5 transcription start sites and discovery of a large number of additional exons in ENCODE regions. Genome Res. 17, 746–759 C The C 2009 Biochemical Society Authors Journal compilation 18 Marques, A.C., Dupanloup, I., Vinckenbosch, N., Reymond, A. and Kaessmann, H. (2005) Emergence of young human genes after a burst of retroposition in primates. PLoS Biol. 3, e357 19 Kaessmann, H., Vinckenbosch, N. and Long, M. (2009) RNA-based gene duplication: mechanistic and evolutionary insights. Nat. Rev. Genet. 10, 19–31 20 Gunter, C. and Dhand, R. (2002) Human biology by proxy. Nature 420, 509 21 Abd El-Aziz, M.M., Barragan, I., O’Driscoll, C.A., Goodstadt, L., Prigmore, E., Borrego, S., Mena, M., Pieras, J.I., El-Ashry, M.F., Safieh, L.A. et al. (2008) EYS, encoding an ortholog of Drosophila spacemaker, is mutated in autosomal recessive retinitis pigmentosa. Nat. Genet. 40, 1285–1287 22 Stedman, H.H., Kozyak, B.W., Nelson, A., Thesier, D.M., Su, L.T., Low, D.W., Bridges, C.R., Shrager, J.B., Minugh-Purvis, N. and Mitchell, M.A. (2004) Myosin gene mutation correlates with anatomical changes in the human lineage. Nature 428, 415–418 23 Shi, P. and Zhang, J. (2007) Comparative genomic analysis identifies an evolutionary shift of vomeronasal receptor gene repertoires in the vertebrate transition from water to land. Genome Res. 17, 166–174 24 Liman, E.R. (2006) Use it or lose it: molecular evolution of sensory signaling in primates. Pflugers Arch. 453, 125–131 25 Bailey, J.A., Liu, G. and Eichler, E.E. (2003) An Alu transposition model for the origin and expansion of human segmental duplications. Am. J. Hum. Genet. 73, 823–834 26 Kim, P.M., Lam, H.Y., Urban, A.E., Korbel, J.O., Affourtit, J., Grubert, F., Chen, X., Weissman, S., Snyder, M. and Gerstein, M.B. (2008) Analysis of copy number variants and segmental duplications in the human genome: evidence for a change in the process of formation in recent evolutionary history. Genome Res. 18, 1865–1874 27 Wang, X. and Zhang, J. (2006) Remarkable expansions of an X-linked reproductive homeobox gene cluster in rodent evolution. Genomics 88, 34–43 28 Spiess, A.N., Walther, N., Muller, N., Balvers, M., Hansis, C. and Ivell, R. (2003) SPEER: a new family of testis-specific genes from the mouse. Biol. Reprod. 68, 2044–2054 29 Lawn, R.M., Schwartz, K. and Patthy, L. (1997) Convergent evolution of apolipoprotein(a) in primates and hedgehog. Proc. Natl. Acad. Sci. U.S.A. 94, 11992–11997 30 Emes, R.D., Goodstadt, L., Winter, E.E. and Ponting, C.P. (2003) Comparison of the genomes of human and mouse lays the foundation of genome zoology. Hum. Mol. Genet. 12, 701–709 31 Karn, R.C., Clark, N.L., Nguyen, E.D. and Swanson, W.J. (2008) Adaptive evolution in rodent seminal vesicle secretion proteins. Mol. Biol. Evol. 25, 2301–2310 32 Jackson, M., Watt, A.J., Gautier, P., Gilchrist, D., Driehaus, J., Graham, G.J., Keebler, J., Prugnolle, F., Awadalla, P. and Forrester, L.M. (2006) A murine specific expansion of the Rhox cluster involved in embryonic stem cell biology is under natural selection. BMC Genomics 7, 212 33 Dopman, E.B. and Hartl, D.L. (2007) A portrait of copy-number polymorphism in Drosophila melanogaster. Proc. Natl. Acad. Sci. U.S.A. 104, 19920–19925 34 Nguyen, D.Q., Webber, C. and Ponting, C.P. (2006) Bias of selection on human copy-number variants. PLoS Genet. 2, e20 35 Marais, G. (2003) Biased gene conversion: implications for genome and sex evolution. Trends Genet. 19, 330–338 36 Berglund, J., Pollard, K.S. and Webster, M.T. (2009) Hotspots of biased nucleotide substitutions in human genes. PLoS Biol. 7, e26 37 Galtier, N., Duret, L., Glemin, S. and Ranwez, V. (2009) GC-biased gene conversion promotes the fixation of deleterious amino acid changes in primates. Trends Genet. 25, 1–5 38 Hurst, L.D. (2009) Evolutionary genomics: a positive becomes a negative. Nature 457, 543–544 39 Dawkins, R. and Krebs, J.R. (1979) Arms races between and within species. Proc. R. Soc. London Ser. B 205, 489–511 40 Laukaitis, C.M., Heger, A., Blakley, T.D., Munclinger, P., Ponting, C.P. and Karn, R.C. (2008) Rapid bursts of androgen-binding protein (Abp) gene duplication occurred independently in diverse mammals. BMC Evol. Biol. 8, 46 41 Birtle, Z., Goodstadt, L. and Ponting, C. (2005) Duplication and positive selection among hominin-specific PRAME genes. BMC Genomics 6, 120 Protein Evolution: Sequences, Structures and Systems 42 Johnson, M.E., Viggiano, L., Bailey, J.A., Abdul-Rauf, M., Goodwin, G., Rocchi, M. and Eichler, E.E. (2001) Positive selection of a gene family during the emergence of humans and African apes. Nature 413, 514–519 43 Goriely, A., McVean, G.A., Rojmyr, M., Ingemarsson, B. and Wilkie, A.O. (2003) Evidence for selective advantage of pathogenic FGFR2 mutations in the male germ line. Science 301, 643–646 44 Ohno, S. (1970) Evolution by gene duplication, Springer-Verlag, Heidelberg 45 Perry, G.H., Yang, F., Marques-Bonet, T., Murphy, C., Fitzgerald, T., Lee, A.S., Hyland, C., Stone, A.C., Hurles, M.E., Tyler-Smith, C. et al. (2008) Copy number variation and evolution in humans and chimpanzees. Genome Res. 18, 1698–1710 Received 10 February 2009 doi:10.1042/BST0370734 C The C 2009 Biochemical Society Authors Journal compilation 739