* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Genomics of the evolutionary process
Polymorphism (biology) wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Population genetics wikipedia , lookup
Oncogenomics wikipedia , lookup
Gene expression programming wikipedia , lookup
Segmental Duplication on the Human Y Chromosome wikipedia , lookup
Genetic engineering wikipedia , lookup
Copy-number variation wikipedia , lookup
Gene desert wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Genomic imprinting wikipedia , lookup
Human genetic variation wikipedia , lookup
Gene expression profiling wikipedia , lookup
Adaptive evolution in the human genome wikipedia , lookup
Transposable element wikipedia , lookup
Koinophilia wikipedia , lookup
Designer baby wikipedia , lookup
History of genetic engineering wikipedia , lookup
Genome (book) wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Public health genomics wikipedia , lookup
Minimal genome wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Genomic library wikipedia , lookup
Non-coding DNA wikipedia , lookup
Metagenomics wikipedia , lookup
Human genome wikipedia , lookup
Microevolution wikipedia , lookup
Helitron (biology) wikipedia , lookup
Human Genome Project wikipedia , lookup
Pathogenomics wikipedia , lookup
Review TRENDS in Ecology and Evolution Vol.21 No.6 June 2006 Genomics of the evolutionary process Andrew G. Clark Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY 14853, USA Comparative analysis of genome sequences has become the primary means by which functional elements are first identified, often preceding even the identification of their function. Although this approach capitalizes on the conservation of homologous functions, it has also been successful in identifying evolutionary novelties, including new genes and pathways. As I discuss here, the analysis of multiple alignments of sequences from species on a known phylogeny has provided rich detail about the heterogeneities in the process of genome changes. Inferences of positive selection acting on protein-encoding genes have provided clues about the role of adaptive evolution in the past. These methods also identify negatively selected genes, providing some clue to genes that are most likely to be mutable to a disease-causing state. Comparative genomics: discovery and hypothesisdriven science One of the most wonderful things about comparative genomics is that it has turned a whole generation of molecular biologists into evolutionists, full of excitement about the way that evolution has sculpted exquisite modifications to organismal genomes and eager to tell stories about it. At the same time, one of its worst disasters is that it has created a hoard of genomics investigators who think that evolutionary biology is just fun, speculative story telling. Sadly, much of the scientific publication industry seems to respond to the herd as much as it does to scientific rigor, and so we have a bit of a mess on our hands. Fortunately, this is all a temporary aberration and, eventually, the noise will be separated from the signal, and progress will march on in understanding what genome sequence divergence really means. The genomics enterprise has fallen upon us so suddenly that it is hardly surprising that a thoughtful and contemplative field such as evolutionary biology is struggling to keep up. It is good for the field to have this upheaval, as it has brought in a wealth of new ideas and approaches. In the meantime, it has become clear that complete genome sequencing of multiple species is providing a deep and inspiring set of new problems and solutions for evolutionary biologists, including those working on species that they thought were phylogenetically remote from any model organism. Comparative evolutionary genomic analysis has proceeded along two parallel paths. The first takes the old hypothesis- and model-driven approach in population biology, where a problem is developed abstractly, even to the point of a mathematical model, and the data are then Corresponding author: Clark, A.G. ([email protected]). Available online 5 May 2006 considered for their ability to assess the merits of the model. This approach has been empowered by genomic data, because their sheer volume makes the power of the tests high. However, the models are generally simplified cartoons of reality and, by the time one has whole-genome data, the many ways that the real data can depart from the clean, simple model become all too apparent. The second approach is more akin to the voyage of the HMS Beagle, setting sail to who knows where, amassing genome sequence data on our hard drives and pawing through it to discover things we have not seen before. Although it was easy to make fun of this ‘discovery science’ approach when it was first espoused, we now know that it has done exactly what was promised: all kinds of exciting, unexpected things happen to genomes as they diverge, and a thorough description of these observations has considerable merit. In many ways, this descriptive, discovery aspect of genomic science is a recapitulation of the grand era of natural history during the 1800s, when the detailed cataloging of nature gave way to the hunger to understand why there is such diversity and structure to the natural world. Descriptive genomics Sound genomic analysis must always begin with a thorough description. We do not yet know the function of all the elements of any genome, but at least we can attempt to provide a useful description of the data at hand. With just the slightest scratching beneath the surface, these descriptive studies reveal a great deal to stimulate thinking and questioning about genome evolution. In many ways, the natural historical description of a genome is more pregnant with implied evolutionary process than is a description of the natural history of an organism. This is because each genome carries clear echoes of its past. Consider the inference of evolutionary process that comes from the analysis of genomic duplications reported by Evan Eichler’s group [1]. By simple computational algorithms, these researchers have found that nearly onethird of the duplications in the human genome are not present in chimpanzee and that, overall, the process of duplication has altered more nucleotides than have singlenucleotide substitutions since our common ancestry with chimpanzees. Despite all this fluidity that has been introduced by gene duplication and rearrangement, there is variation in the degree of maintenance of synteny (chromosomal co-occurrence of groups of genes) across disparate species. Why should this be so? Is it simply chance, or is there a level of genomic organization that has functional consequences? This second question is www.sciencedirect.com 0169-5347/$ - see front matter Q 2006 Elsevier Ltd. All rights reserved. doi:10.1016/j.tree.2006.04.004 Review TRENDS in Ecology and Evolution a challenge because it demands comparison across multiple disparate species and their genomes. There has been some recent progress on making evolutionary inferences about the determinants of genome size by the analysis of transposons [2], or through the analysis of gene loss in parasitic plants [3] and endosymbiotic microbes [4]. It seems to be a common theme in evolution that when genes are no longer needed, especially after adopting a parasitic life style, then they are simply lost. Parasitic plants, for instance, can dispense with the maintenance of costly genes involved in photosynthesis. Gene transfers, both horizontal (i.e. movement of segments of genomes from across distantly related species) and otherwise, remain challenging partly because they break the rules of phylogenetic inference [5,6]. For genetic sequence divergence that remains concordant with a species phylogeny, sophisticated analysis remains possible, but once genes undergo horizontal transfer, their phylogeny departs from the species phylogeny, and the best that one can do is estimate the rate of transfer. Many other attributes of genome structure still require evolutionary explanations. Why are genomes structured to have regions with such wildly fluctuating gene density (‘deserts and jungles’)? Is this adaptive in any sense, or does it arise as a result of local constraints on sequence evolution? What is the role of heterochromatin in genome evolution, and why is there such variation among organisms in how much heterochromatin their genomes bear? Centromeres have a central role in chromosome disjunction, but their structure gives no clue as to how they function or how their functions evolve, other than the likelihood that selfish elements drive them [7]. Although Vol.21 No.6 June 2006 317 there is already a library full of papers on the evolution of transposable elements, it is clear that the analysis of multiple complete genome sequences will enable an unprecedented assessment of their intragenomic demography. Horizontal transfers can occur among highly unpredictable pairs of species. There is even movement of genetic material from the mitochondrion out to the nuclear genome, and many higher organisms experience this at a sufficient rate that the complete mitochondrial genome occurs in pieces within the nuclear genome [8]. The initial genome sequences of the ascidian Ciona intestinalis and of the mosquito Anopheles gambiae were only partially successful owing to problems in assembling some regions of the genome that appeared to be maintaining two widely disparate haplotypes. Were these the result of ancient introgression events? What maintains these regions in the genome? Across different individuals are the same regions maintaining this heterogeneity? How common is this kind of genome structure? Finding functional elements One can know little about the field of evolutionary biology and still make use of the idea that conserved elements of genome sequences are more likely to be functional than are regions that are substituted beyond recognition. This argument can be carried too far (Figure 1), and it is best to avoid the conclusion that just because something is conserved it must have a vital function. There might be another molecular process that gives a portion an abnormally low mutation rate, for example by virtue of palindromic repeats whose head-to-head redundancy enables gene conversion across the repeats. A widely APOE Gibbon Baboon Marmoset Mouse Dog Cow Chicken Fugu 50.100 50.101 50.102 50.103 50.104 50.105 50.106 Position on chr 19 (Mb) TRENDS in Ecology & Evolution Figure 1. Identifying conserved DNA sequence elements. Alignment of genome sequences and tools for visualizing those alignments have proven remarkably useful in identifying how evolution has resulted in the conservation of genetic elements and the erasure of nonessential sequence. The figure shows a multi-species alignment of a homologous region to chr 19 in humans spanning the gene encoding apolipoprotein E. Each band of the figure has a Y axis spanning from 50% to 100% sequence identity. The three exons, indicated in blue, show strong conservation in all mammals, whereas the gene is missing altogether from chicken and fugu. The phylogenetic tree shows only the rough topology of the phylogeny and not the temporal scale. For more information on related visualization tools, see http://genome.lbl.gov/vista/index.shtml (tools from this website were used to generate the figure). www.sciencedirect.com 318 Review TRENDS in Ecology and Evolution cited conclusion drawn from the sequencing of the mouse genome is that, because of the pattern of conservation with human genome sequence, 5% of mammalian genomes are undergoing purifying natural selection, a form of natural selection that results in the conservation of sequence across species [9]. This estimate has not faced any serious challenge, but the initial calculation is based on a simple model that cannot accommodate the intricacies of variation in the process of mutation itself. Recently, a series of papers reported on exceptionally conserved non-protein-encoding elements in the genome [10–13]. These regions are polymorphic within species and have a spectrum of nucleotide frequencies that, in a manner consistent with purifying selection, departs from that expected under neutrality [14]. The existence of polymorphism rules out the hypothesis that the exceptionally slow divergence might arise from a simple lack of mutation. It seems inescapable that these regions must encode a conserved function. The conserved elements are probably involved in a variety of disparate functions. One such function is encoded by noncoding RNAs or microRNAs, which have a key role in regulating gene expression, and whose double-stranded structure presents a signal that is amenable to genome-wide scanning [15]. Recently, it was shown that a conserved enhancer element is related to an ancient retroelement, owing to its redundancy and within-genome similarity (Gill Bejerano, pers. commun.). In addition to seeking regions with exceptionally low rates of DNA base substitution, attempts have been made to identify functional regions by their unusually low rates of insertion and deletion in the sequence [16]. This approach also finds regions that are remarkably devoid of change in the human versus mouse comparison, and a model fit suggests that w3% of the genome appears to be undergoing purifying natural selection (assuming a reasonable distribution of insertion rate variation). One of the most important first tasks for investigators studying genomic sequence is to identify the proteinencoding genes, a task that remains surprisingly challenging. The reason that such genes are so hard to find is that they are so diverse in their structures, the signatures of different components of genes can be subtle and genes can overlap in the most bizarre ways. The panoply of nested genes, overlapping genes, genes with shared exons and exons that are read in more than one reading frame is sufficiently bizarre to make one wonder whether accurate gene finding is something that a computer will ever be able to do. A development that will only continue to increase in power and utility is to make use of pairwise alignments and multi-alignments to find genes [17]. These methods use both the among-species conservation and the conservation of putative functional attributes of genes (start codon, splice signals and polyadenylation signals) to identify probable features of genes. Possession of a complete gene list provides a powerful tool for comparison between species, because the inference of the absence of a gene implies that the respective function is either lost or replaced by another gene. Such exhaustive lists of character states (gene presence or absence) is also remarkably informative for phylogeny reconstruction. www.sciencedirect.com Vol.21 No.6 June 2006 As well as finding genes, the identification of regulatory regions of genes is another problem that relies on evolutionary conservation. It seemed as though the entire genomics community was expecting the analysis of the mouse genome sequence to identify all the conserved regulatory elements between human and mouse, and to use this conservation to annotate the regulatory features of the human genome. Unfortunately, many of the regulatory elements are not conserved, and many of the small upstream gene regions that are conserved appear not to have a regulatory function [18]. By comparing sequences of many closely related primates, Boffelli et al. [19] found that regulatory regions have a deficit of nucleotide differences compared with flanking regions, and so they leave a shadow of their presence. This ‘phylogenetic shadowing’ works even better than was initially proposed, and extensions of this approach have been used successfully to reveal regulatory features of genomes. The goal of finding regulatory features is so important that another major initiative of the National Human Genome Research Institute (http://www.nhlbi.nih.gov) was launched to identify exhaustively the functional features of 1% of the human genome, selected to span the range of gene density, recombination rate and several other attributes. This ‘ENCODE’ project (http://www.genome.gov/10005107) is proving to be a goldmine for the detailed analysis of a small portion of the genome. The project also underscores the nature and depth of our ignorance of the remaining 99% of the genome by enabling us to compare the analysis of the low- and high-resolution data for these regions. This was especially well used in the Haplotype Map project (http://www.hapmap.org), where the inference of linkage disequilibrium with unobserved variation could be tested in the denser ENCODE regions of the genome [20]. Evolutionary biologists all too often find themselves in the maddening situation of having to rely on sequences of reasonably good quality, but whose annotation of functional attributes seems to be little more than a guess. One supposes that the reason so little funding goes into genome annotation compared with sequence generation is that annotation is somehow less glamorous, and that perhaps in the future automatic annotation will get better. In the meantime, a bigger limit to progress in evolutionary genomic analysis is poor annotation rather than lack of sequence. Non-uniformity of the substitution process One of the first models of molecular evolution was to suppose that the four bases of DNA undergo substitution at equal rates, m. This Jukes-Cantor model [21] provided a quantitative relation between the substitution rate and the level of sequence divergence as a function of time. However, it was woefully inadequate because empirical data showed that transitions occur at a much greater rate than transversions, giving rise to a model with two parameters, one for each class of mutation. Now, there is a large diversity of such substitution models and a suite of tools that enable one to fit these models (by estimating the parameters that quantify the various rates) to sequence alignments. However, as one scans along a genome, these Review TRENDS in Ecology and Evolution substitution rate estimates vary widely. Even worse, the best-fitting model for the substitution process for one region of the genome has a different structure from the best-fitting model for other genomic regions. This is especially evident in mammals, whose genomes have great within-species heterogeneity in base composition, with regions of high GC content having a different substitution rate matrix compared with regions of low GC content. Remarkably, translocations of genetic material from a region of low GC to high GC content results in a shift in the substitution process so that the newly arrived segment tends, over time, to match the GC content of the surrounding sequence. As proud as we are of these modeling efforts, we also know that they fail to capture all the details of how substitutions occur. For example, it is clear that even noncoding sequences have strong biases in dinucleotide content [22]. This implies that a better-fitting model would accommodate these neighboring base effects, and could not be represented by a simple matrix of transitions from one single base to the next. One particular class of neighboring-base effect is the direct empirical observation that CpG dinucleotides in mammals have about ten times the mutation rate of other dinucleotides. Transposable elements provide a useful tool for comparing the substitution process across different parts of the genome, because they started from multiple identical sequences and the elements observed at present can be placed on a phylogeny with directional substitutions. Singh et al. [23] applied this approach to the 1500 nonfunctional DNAREP1-DM elements in the genomes of Drosophila melanogaster and found enormous heterogeneity in the process of substitution. One of these biases is a recombination-associated GC bias, which can explain the positive correlation between GC content and local recombination rates. Inference of adaptive evolution of protein-encoding genes One of the problems in evolutionary analysis that predated genomics and yet provides important insight to genomic analysis is the inference of past adaptive evolution based on DNA sequence. Positive natural selection has been inferred by a variety of approaches, including the relative rates of substitution at synonymous (amino-acid preserving) and nonsynonymous (amino-acid changing) sites [24,25], comparisons between levels of polymorphism and divergence [26,27] and analysis of geographical patterns of variation [28]. All these methods can be applied at a genome-wide level once there is genome-wide alignment of protein-encoding genes, and, more rarely, when there is genome-wide sequence data on multiple individuals to provide polymorphism data (e.g. [29]). One is tempted to think that this means that any microsatellite or single-nucleotide polymorphism (SNP) data, such as the data from the human international Haplotype Map project could apply, but, unfortunately, most of the methods perform best with full sequence data. Even with complete sequence data, the past demographic changes in the population can result in departures of the patterns of polymorphism (the site frequency spectrum in www.sciencedirect.com Vol.21 No.6 June 2006 319 particular) from the neutral case. Thus, the challenge in doing genome-wide analysis is to make use of the contrasts of many genes embedded in the same population demography, and to make tests of selection that are robust to demographic effects [30,31]. Maximum likelihood methods for estimating rates of divergence and for tests of the neutral null hypothesis have been scaled up to comparisons of protein sequences of human, chimpanzee and mouse [32,33]. This choice of species was not optimal for the test because human and chimpanzee are too closely related, and mouse is too distantly related from primates. By choosing sets of species with evolutionary divergence appropriate for these methods, such as the six or so species most closely related to Drosophila melanogaster, or the plant species in the family Solanaceae [34], these methods will be particularly useful for identifying lineage-specific changes in selective pressures. The methods continue to be refined, especially as we learn about the roles of variation in the nucleotide substitution process, and of the way that changes in demographic history can impact patterns of genetic variation. One exceptional application of patterns of polymorphism and divergence is to identify particular residues in proteins that are likely to have deleterious consequences by virtue of being evolutionarily invariant across multiple species, but showing a radical change in the mutant in question [35,36]. When one compares the level of conservation across mammals at amino acid residues, those positions that harbor major mendelian disorders in humans tend to be more conserved than are residues that are polymorphic in humans without medical consequence. Similarly, there is some promise for methods that predict alleles of human genes that could be associated with disease, again based only on polymorphism and divergence data [29]. Looking forward: major growth areas in comparative genomics The first and most obvious trend in comparative genomics that is easy to predict is that the data will continue to pour in at a mind-numbing rate. I wrote this review before the raft of analysis of the dozen Drosophila genome sequences were open for free analysis (http://rana.lbl.gov/drosophila). These genome data will enable any analysis to be done that makes use of a phylogeny on a genome-wide scale. We have seen a beginning to such phylogeny-based analysis in mammals (e.g. [37]), and the power of these approaches is much improved with five and more species [38]. The second most obvious trend is that the tools by which comparative genomics are being done now will see considerable improvement. Methods for whole-genome alignment are currently a crude shadow of what should be possible. Alignment tools should combine the best inference of functional elements, especially proteinencoding genes, and the alignments should be done applying the best evolutionary models for the conservation of those functional elements. Genome browsers for representing multiple species genome-wide alignments are a nice start, but they are cumbersome, error prone and also need radical improvement. We thus need easier 320 Review TRENDS in Ecology and Evolution ways to visualize and extract comparative sequence information [39]. As the tools are improved, and as more intraspecific genome variation becomes available, the power to make inference about microevolutionary processes will increase, because the most powerful tests of function make use of patterns of polymorphism within species as well as divergence between species. The best way to connect sequence differences with functional differences is through experimental manipulation of model organisms. I would argue that the best path to understanding how species diverge at a genomic level is to better understand those processes within populations that generate and maintain genomic variation. A massive experiment is being done with humans in the medical genetics community, where studies are assessing associations between variation at phenotypic and genotypic levels by testing directly association in large samples that are measured for medically relevant phenotypes and 500 000 or more SNPs. Such studies are being carried out by the Wellcome Trust Case-Control Consortium (http://www.wtccc.org.uk) and one of many such efforts sponsored by the US National Institutes of Health is to perform genotyping of w500 000 SNPs in the Framingham Heart Study (http://www.framingham.com/heart/) for purposes of identifying genetic variation associated with elevated risk for heart disease. This kind of genome-wide assessment of polymorphism provides unprecedented power with which to quantify variation and make inferences about past demographic and selective forces. The efforts in medical applications have driven the cost of genotyping down far enough that it seems inevitable that this kind of data will become available for a variety of organisms. There is a huge challenge in designing not only the analytical procedures for making sensible statistical inference from such data, but also in optimizing the experimental design in the first place. An understanding of the major revolutions in evolution demands methods of analysis that can deal with gross changes, such as radical reworking of body plans [40]. The best way to understand how development changes through evolution is to first acquire a better understanding of the gene regulatory networks within a model species. For example, the extraordinary engineering of the endo16 regulatory network in sea urchins took years of work for Eric Davidson’s group to tease apart [41], and now a comparative analysis of genetic regulation of early development in the sea star is revealing many significant parallels [42]. The future for evolutionary developmental genetics is certainly bright. Twenty years spanning the origin of genomics The idea of sequencing the full genome of multiple organisms could have been on the minds of some science fiction writers in 1986, but most of the readers of the first issue of TREE would have been stunned to know that, in just 20 years, genomics would have progressed as far as it has. Although there were a few complete genome sequences known in 1986, including the complete human mitochondrial genome sequence [43], the richness of the questions that can be pursued with whole-genome data www.sciencedirect.com Vol.21 No.6 June 2006 continues to inspire cohorts of students and researchers. Much of this is driven by a combination of computational rigor and novel statistical modeling. The first methods to estimate rates of substitution at synonymous and nonsynonymous nucleotide positions [44] were published just one year prior to the inaugural issue of TREE. This has grown to a massive enterprise that is now applied routinely to whole-genomes worth of encoding sequences to identify genes that appear to exhibit accelerated protein evolution. Many questions that were prominent in evolutionary genetics before 1986 remain so, and have been enriched by genomic data. The idea of understanding speciation by experimentally determining the genetic basis for interspecific hybrid sterility and inviability was an active field and is now one that flourishes anew through tools that enable the determination of the individual genes that have large effects on hybrid performance [45–47]. The depth of our understanding of hybrid gene function has been greatly expanded by being able to query the transcript abundance of almost the entire genome [48], and dysregulation of gene networks in hybrids provides a unique view of how those networks evolve in the first place. As well as the great intellectual satisfaction that is coming from discoveries in comparative genomics, it is a branch of evolutionary biology that also has enormous practical benefit. Perhaps the most obvious is that the most rapid way to understand which aspects of a genome have functional roles is through patterns of sequence conservation, and our ability to annotate functional elements of our own genome relies almost entirely on this approach [20]. But the same could be said not only of the genomes of model organisms, but about the genomic basis for phenotypes of crucial pathogenic importance, such as the transmission of Plasmodium by Anopheles mosquitoes [49,50]. Dobzhansky’s claim is overused, but one thing that genomic science has made patently obvious is the essential role of evolutionary thinking in making sense of biology. References 1 Cheng, Z. et al. (2005) A genome-wide comparison of recent chimpanzee and human segmental duplications. Nature 437, 88–93 2 Petrov, D.A. et al. (2000) Evidence for DNA loss as a determinant of genome size. Science 287, 1060–1062 3 Wolfe, K.H. et al. (1992) Function and evolution of a minimal plastid genome from a nonphotosynthetic parasitic plant. Proc. Natl. Acad. Sci. U. S. A. 89, 10648–10652 4 Moran, N.A. (2003) Tracing the evolution of gene loss in obligate bacterial symbionts. Curr. Opin. Microbiol. 6, 512–518 5 Mower, J.P. et al. (2004) Plant genetics: gene transfer from parasitic to host plants. Nature 432, 165–166 6 Lolle, S.J. et al. (2005) Genome-wide non-mendelian inheritance of extra-genomic information in Arabidopsis. Nature 434, 505–509 7 Malik, H.S. et al. (2002) Recurrent evolution of DNA-binding motifs in the Drosophila centromeric histone. Proc. Natl. Acad. Sci. U. S. A. 99, 1449–1454 8 Richly, E. and Leister, D. (2004) NUMTs in sequenced eukaryotic genomes. Mol. Biol. Evol. 21, 1081–1084 9 Waterston, R.H. et al. (2002) Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 Review TRENDS in Ecology and Evolution 10 Dermitzakis, E.T. et al. (2002) Numerous potentially functional but non-genic conserved sequences on human chromosome 21. Nature 420, 578–582 11 Dermitzakis, E.T. et al. (2003) Evolutionary discrimination of mammalian conserved non-genic sequences (CNGs). Science 302, 1033–1035 12 Bejerano, G. et al. (2004) Ultraconserved elements in the human genome. Science 304, 1321–1325 13 Siepel, A. et al. (2005) Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034–1050 14 Drake, J.A. et al. (2006) Conserved noncoding sequences are selectively constrained and not mutation cold spots. Nat. Genet. 38, 223–227 15 Lall, S. et al. (2006) A genome-wide map of conserved microRNA targets in C. elegans. Curr Biol. 16, 460–471 16 Lunter, G. et al. (2006) Genome-wide identification of human functional DNA using a neutral indel model. PLoS Comput. Biol. 2, e5 17 Korf, I. et al. (2001) Integrating genomic homology into gene structure prediction. Bioinformatics 17(Suppl. 1), S140–S148 18 Wray, G.A. et al. (2003) The evolution of transcriptional regulation in eukaryotes. Mol. Biol. Evol. 20, 1377–1419 19 Boffelli, D. et al. (2003) Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science 299, 1391–1394 20 ENCODE Project Consortium. (2004) The ENCODE (ENCyclopedia Of DNA Elements) Project. Science 306, 636–640 21 Jukes, T.H. and Cantor, C.R. (1969) Evolution of protein molecules. In Mammalian Protein Metabolism III (Munro, H., ed.), pp. 21–132, Academic Press 22 Karlin, S. and Mrazek, J. (1997) Compositional differences within and between eukaryotic genomes. Proc. Natl. Acad. Sci. U. S. A. 94, 10227–10232 23 Singh, N.D. et al. (2005) Genomic heterogeneity of background substitutional patterns in Drosophila melanogaster. Genetics 169, 709–722 24 Hughes, A.L. and Nei, M. (1988) Pattern of nucleotide substitution at major histocompatibility complex class I loci reveals overdominant selection. Nature 335, 167–170 25 Nielsen, R. and Yang, Z. (1998) Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics 148, 929–936 26 Hudson, R.R. et al. (1987) A test of neutral molecular evolution based on nucleotide data. Genetics 116, 153–159 27 McDonald, J.H. and Kreitman, M. (1991) Adaptive protein evolution at the Adh locus in Drosophila. Nature 351, 652–654 28 Akey, J.M. et al. (2004) Population history and natural selection shape patterns of genetic variation in 132 genes. PLoS Biol. 2, e286 29 Bustamante, C.D. et al. (2005) Natural selection on protein-coding genes in the human genome. Nature 437, 1153–1157 30 Jensen, J.D. et al. (2005) Distinguishing between selective sweeps and demography using DNA polymorphism data. Genetics 170, 1401–1410 Vol.21 No.6 June 2006 321 31 Wright, S.I. et al. (2005) The effects of artificial selection on the maize genome. Science 308, 1310–1314 32 Clark, A.G. et al. (2005) Ascertainment bias in studies of human genome-wide polymorphism. Genome Res. 15, 1496–1502 33 Nielsen, R. et al. (2005) A scan for positively selected genes in the genomes of humans and chimpanzees. PLoS Biol. 3, 976–985 34 Mueller, L.A. et al. (2005) The SOL Genomics Network: a comparative resource for Solanaceae biology and beyond. Plant Physiol. 138, 1310–1317 35 Sunyaev, S. et al. (2001) Prediction of deleterious human alleles. Hum. Mol. Genet. 10, 591–597 36 Ng, P.C. and Henikoff, S. (2003) SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res. 31, 3812–3814 37 Clark, A.G. et al. (2003) Inferring nonneutral evolution from human– chimp–mouse orthologous gene trios. Science 302, 1960–1963 38 Wong, W.S. et al. (2004) Accuracy and power of statistical methods for detecting adaptive evolution in protein coding sequences and for identifying positively selected sites. Genetics 168, 1041–1051 39 Miller, W. et al. (2004) Comparative genomics. Annu. Rev. Genomics. Hum. Genet. 5, 15–56 40 Angelini, D.R. and Kaufman, T.C. (2005) Comparative developmental genetics and the evolution of arthropod body plans. Annu. Rev. Genet. 39, 95–119 41 Yuh, C.H. and Davidson, E.H. (1996) Modular cis-regulatory organization of Endo16, a gut-specific gene of the sea urchin embryo. Development 122, 1069–1082 42 Otim, O. et al. (2005) Expression of AmHNF6, a sea star orthologue of a transcription factor with multiple distinct roles in sea urchin development. Gene Expr. Patt. 5, 381–386 43 Anderson, S. et al. (1981) Sequence and organization of the human mitochondrial genome. Nature 290, 457–465 44 Li, W-H. et al. (1985) A new method for estimating synonymous and nonsynonymous rates of nucleotide substitution considering the relative likelihood of nucleotide and codon changes. Mol. Biol. Evol. 2, 150–174 45 Presgraves, D.C. et al. (2003) Adaptive evolution drives divergence of a hybrid inviability gene between two species of Drosophila. Nature 423, 715–719 46 Barbash, D.A. et al. (2003) A rapidly evolving MYB-related protein causes species isolation in Drosophila. Proc. Natl. Acad. Sci. U. S. A. 100, 5302–5307 47 Sun, S. et al. (2004) The normal function of a speciation gene, Odysseus, and its hybrid sterility effect. Science 305, 81–83 48 Ranz, J.M. et al. (2004) Anomalies in the expression profile of interspecific hybrids of Drosophila melanogaster and Drosophila simulans. Genome Res. 14, 373–379 49 Richards, S. et al. (2005) Comparative genome sequencing of Drosophila pseudoobscura: chromosomal, gene, and cis-element evolution. Genome Res. 15, 1–18 50 Christophides, G.K. et al. (2004) Comparative and functional genomics of the innate immune system in the malaria vector Anopheles gambiae. Immunol. Rev. 198, 127–148 AGORA initiative provides free agriculture journals to developing countries The Health Internetwork Access to Research Initiative (HINARI) of the WHO has launched a new community scheme with the UN Food and Agriculture Organization. As part of this enterprise, Elsevier has given 185 journals to Access to Global Online Research in Agriculture (AGORA). More than 100 institutions are now registered for the scheme, which aims to provide developing countries with free access to vital research that will ultimately help increase crop yields and encourage agricultural self-sufficiency. According to the Africa University in Zimbabwe, AGORA has been welcomed by both students and staff. ‘It has brought a wealth of information to our fingertips’ says Vimbai Hungwe. ‘The information made available goes a long way in helping the learning, teaching and research activities within the University. Given the economic hardships we are going through, it couldn’t have come at a better time.’ For more information visit: http://www.healthinternetwork.net www.sciencedirect.com