Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Signals of natural selection in the HapMap project data The International HapMap Consortium Gil McVean Department of Statistics, Oxford University The International HapMap Project • To facilitate the design and analysis of association studies • A genome-wide map of genetic variation across 270 individuals from four populations – – – – CEPH families from Utah Yoruba from Nigeria Han Chinese from Beijing Japanese from the Tokyo region • Phase I collected data on approximately 1.2 million SNPs • Phase II increases SNP density to more than one per kb • All data publicly available at www.hapmap.org Looking for selection • A genome-wide map of variation can also be used to hunt for regions of the genome where natural selection has acted – Selective sweeps – Balancing selection – Local adaptation • Why? – Interest – Functional polymorphism – The signal of selection we observe tells us about the genetic architecture of traits Methods for mapping selection • Model-based – Compare genetic variation to ‘neutral’ model • Purely empirical – Consider the ‘most extreme’ genomic regions • ‘Calibrated’ – Compare to examples of (very few) proven selective importance In what way are selected regions unusual? (in the HapMap data) HLA 17q21 inversion Lactase Duffy HLA and resistance to infectious disease HLA The HLA region shows extremely high levels of polymorphism 17q21 inversion and reproductive success The inversion has multiple (66) SNPs in perfect association (r2 = 1) LCT and lactase persistence The LCT gene shows an extended haplotype structure in European populations The Duffy locus and resistance to Plasmodium vivax The FY gene shows extreme population differentiation Different selective histories leave different footprints in genetic variation How much of the genome looks as ‘unusual’ as these selected loci? Heterozygosity as extreme as HLA HLA Sets of perfect proxies as extreme as the 17q21 inversion Inversion EHH as extreme as LCT Lactase Differentiation as extreme as the Duffy locus (NB not FY*O) Duffy For ¾ cases, the selected locus is at the very extreme of the genome-wide distribution What can we learn from the unusual, but less extreme cases? Heterozygosity across the genome Top 1% Top 5% Top 10% Bottom 10% Bottom 5% Bottom 1% Elevated heterozygosity on 8p Chromosome 6 MHC Chromosome 8 8p23 inversion Distribution of long runs of perfect proxies ≥ 50 SNPs 20 – 50 SNPs 10-20 SNPs 17q21 Inversion An inversion on the X chromosome? Distribution of EHH Top 0.1% Top 1% Top 10% A selective sweep on chromosome 5? Distribution of differentiation Top 0.1% Top 1% Top 10% SLC24A5 Lamason et al (Science 2005) Unusual regions of the genome suggest interesting biology BUT The hypothesis of historical selection is fundamentally untestable What hypothesis can we test? Signals of selection should tend to occur near regions of known functional importance i.e. genes Are genes over-represented in regions of high heterozygosity? Are genes over-represented in regions of high proxy number? Are genes over-represented in regions of high EHH? Are genes over-represented in regions of high differentiation? Only differentiation shows a tendency for an increased density of ‘selection’ near genes The wild speculation Selection on standing variation • Why should we see an excess of one type of signal of adaptive evolution near genes, but not another? • Perhaps the signals are sensitive to assumptions about selection occurs? • EHH methods will be most powerful for identifying selection on a single, novel mutation • Differentiation will pick cases where an already polymorphic mutation, present on multiple haplotype backgrounds, becomes favoured in one geographic region • Perhaps most selection has been on standing variation? Acknowledgements • The International HapMap Consortium • Oxford Statistics – Peter Donnelly, Simon Myers, Chris Spencer, Raphaelle Chaix • Funding agencies – NIH, TSC, The Wellcome Trust, BBSRC, the Fyssen Foundation Distribution of Fay and Wu’s H statistic Bottom 0.1% Bottom 1% Bottom 10% Distribution of Tajima D statistic Top 1% Top 5% Top 10% Bottom 10% Bottom 5% Bottom 1% Tajima D (negative) Fay and Wu H (negative) Numbers of SNPs Chromosome #SNPs in common files 1 75850 2 82565 3 59417 4 53219 5 53324 6 61829 7 42588 8 65506 9 51906 10 46073 11 41299 12 38433 13 33757 14 27143 15 24615 16 23400 17 23235 18 35931 19 16505 20 19275 21 17933 22 17244 X PAR 1 408 X non PAR 53594 X PAR 2 45 Totals 965094 #SNPs QC’ed, polymorphic and with ancestral inferred 64107 74829 52523 47878 48504 55344 35240 60306 47285 41185 36687 34895 30779 24487 22124 20779 20576 33137 14246 15700 16281 15196 5 41682 0 853775 Percent Chromosome Approx. converted Length SNP spacing 0.8451813 246043912 3.84 0.9063041 243407499 3.25 0.8839726 199282781 3.79 0.8996411 191710711 4.00 0.9096092 180825316 3.73 0.8951139 170902878 3.09 0.8274631 158542415 4.50 0.920618 146305419 2.43 0.9109737 136326725 2.88 0.8939075 135035657 3.28 0.8883266 134481573 3.67 0.9079437 132017602 3.78 0.9117813 113025098 3.67 0.9021479 105260053 4.30 0.8988015 100133324 4.53 0.8879915 89915381 4.33 0.8855606 81724082 3.97 0.9222398 76114138 2.30 0.8631324 63788762 4.48 0.8145266 63686957 4.06 0.9078793 46956357 2.88 0.8812341 49375569 3.25 0.0122549 2689596 537.92 0.7777363 150671647 3.61 0 328507 NA 0.8846548 3018551959 3.54