* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download The population genetics of human disease: the case of recessive
Designer baby wikipedia , lookup
Polymorphism (biology) wikipedia , lookup
Fetal origins hypothesis wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Oncogenomics wikipedia , lookup
Genome (book) wikipedia , lookup
Tay–Sachs disease wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
Koinophilia wikipedia , lookup
Human genetic variation wikipedia , lookup
Public health genomics wikipedia , lookup
Dominance (genetics) wikipedia , lookup
Hardy–Weinberg principle wikipedia , lookup
Frameshift mutation wikipedia , lookup
Genetic drift wikipedia , lookup
Point mutation wikipedia , lookup
bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. Thepopulationgeneticsofhumandisease:the caseofrecessive,lethalmutations Carlos Eduardo G. Amorim1,2*, Ziyue Gao3, Zachary Baker4, José FranciscoDiesel5,YuvalB.Simons1,ImranS.Haque6,JosephPickrell1,7+, MollyPrzeworski1,4+ 1DepartmentofBiologicalSciences,ColumbiaUniversity,NewYork,NY10027 2CAPESFoundation,MinistryofEducationofBrazil,Brasília,DF,Brazil70040. 3HowardHughesMedicalInstitution,StanfordUniversity,Stanford,CA94305 4DepartmentofSystemsBiology,ColumbiaUniversity,NewYork,NY10027 5UniversidadeFederaldeSantaMaria,SantaMaria,RS,Brazil97105 6 Counsyl,180KimballWay,SouthSanFrancisco,CA 7 NewYorkGenomeCenter,NewYork,NY10013 +Theseauthorsco-supervisedthiswork *Towhomcorrespondenceshouldbeaddressed([email protected]) bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. Abstract Do the frequencies of disease mutations in human populations reflect a simple balance between mutation and purifying selection? What other factors shape the prevalenceofdiseasemutations?Tobegintoanswerthesequestions,wefocusedon one of the simplest cases: recessive mutations that alone cause lethal diseases or complete sterility. To this end, we generated a hand-curated set of 417 Mendelian mutationsin32genes,reportedtocausearecessive,lethalMendeliandisease.We thenconsideredanalyticmodelsofmutation-selectionbalanceininfiniteandfinite populations of constant sizes and simulations of purifying selection in a more realistic demographic setting, and tested how well these models fit allele frequencies estimated from 33,370 individuals of European ancestry. In doing so, we distinguished between CpG transitions, which occur at a substantially elevated rate, and other mutation types. Whereas observed frequencies for CpG transitions areclosetoexpectation,thefrequenciesobservedforothermutationtypesarean order of magnitude higher than expected; this discrepancy is even larger when subtlefitnesseffectsinheterozygotesorlethalcompoundheterozygotesaretaken into account. In principle, higher than expected frequencies of disease mutations could be due to widespread errors in reporting causal variants, compensation by othermutations,orbalancingselection.Itisunclearwhythesefactorswouldaffect CpG transitions differently from other mutations, however. We argue instead that the unexpectedly high frequency of non-CpGti disease mutations likely reflects an ascertainment bias: of all the mutations that cause recessive lethal diseases, those that by chance have reached higher frequencies are more likely to have been bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. identified in medical studies and thus to have been included in this study. Beyond thespecificapplication,thisstudyhighlightstheparameterslikelytobeimportant inshapingthefrequenciesofMendeliandiseasealleles. bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. AuthorSummary What determines the frequencies of disease mutations in human populations? To begintoanswerthisquestion,wefocusononeofthesimplestcases:mutationsthat causecompletelyrecessive,Mendeliandisease.Wefirstreviewtheoryaboutwhatto expect from mutation and selection in a population of finite size and further generatepredictionsbasedonsimulationsusingarealistdemographicscenarioof human evolution. For a highly mutable type of mutations, involving transitions at CpG sites, we find that the predictions fit observed frequencies of recessive lethal diseasemutationswell.Forlessmutabletypes,however,predictionstendtounderestimate the observed frequency. We discuss possible explanations for the discrepancy and point to a complication that, to our knowledge, is not widely appreciated: that there exists ascertainment bias in disease mutation discovery. Specifically, we suggest that alleles that have been identified to date are likely the onesthatbychancehavedriftedtohigherfrequenciesandarethusmorelikelyto have been mapped. More generally, our study highlights the parameters that influence the frequencies of Mendelian disease alleles, helping to interpret the relevanceofvariantsofunknownsignificancebasedontheirallelefrequencies. bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. Introduction New disease mutations arise in heterozygotes and either drift to higher frequenciesorarerapidlypurgedfromthepopulation,dependingonthestrengthof selection and the demographic history of the population [1-6]. Elucidating the relative contributions of mutation, natural selection and genetic drift will help to understand why disease alleles persist in humans. Answers to these questions are alsoofpracticalimportance,ininforminghowgeneticvariationdatacanbeusedto identifyadditionaldiseasemutations[7]. In this regard, rare, Mendelian diseases, which are caused by single highly penetrant and deleterious alleles, are perhaps most amenable to investigation. A simple model for the persistence of mutations that lead to Mendelian diseases is that their frequency is determined by a balance between their introduction by mutationandeliminationbypurifyingselection,i.e.,thatweexpecttofindthemat “mutation-selectionbalance”(MSB)[4].Infinitepopulations,randomdriftleadsto stochastic changes in the frequency of any mutation, so demographic history, in addition to mutation and natural selection, plays an important role in shaping the frequencydistributionofdeleteriousmutations[3]. Another factor that may be important in shaping the frequency of highly penetrantdiseasemutationsisgeneticinteractions.Forinstance,adiseasemutation mayberescuedbyanothermutationinthesamegene[8-10]orbyamodifierlocus elsewhere in the genome that modulates the severity of the disease symptoms or thepenetranceofthediseaseallele(e.g.[11-13]). bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. Forasubsetofdiseaseallelesthatarerecessive,analternativemodelfortheir frequencyinthepopulationisthatthereisanadvantagetocarryingonecopybuta disadvantage to carrying two or none, such that the alleles persist due to overdominance,aformofbalancingselection.Wellknownexamplesincludesickle cell anemia, thalassemia and G6PD deficiency in populations living where the malaria is endemic [14]. Beyond these cases, the importance of overdominance in maintainingthehighfrequencyofdiseasemutationsisunknown. Here,wetestedhypothesesaboutthepersistenceofmutationsthatcauselethal, recessive, Mendelian disorders. This case provides a good starting point towards elucidatingthedeterminantsofdiseasemutationsinhumanpopulations,becausea large number of recessive lethal disorders have been mapped (e.g., genes have already been associated with >66% of Mendelian disease phenotypes; [15]) and, whilethefitnesseffectsofmostdiseasesarehardtoestimate,forlethaldiseases,the selectioncoefficientisclearly1forhomozygotecarriers(atleastintheabsenceof modern medical care). Moreover, the simplest expectation for mutation-selection balancewouldsuggestthat,givenaperbasepairmutationrateontheorderof10-8 pergeneration[16],thefrequencyofsuchalleleswouldbe u ,i.e.,~10-4[4].Thus, samplesizesinhumangeneticsarenowsufficientlylargethatweshouldbeableto observerecessivediseaseallelessegregatinginheterozygotes. Tothisend,wecompiledgeneticinformationforasetof417mutationsreported to cause fatal, recessive Mendelian diseases and estimated the frequencies of the disease-causingallelesfromlargeexomedatasets.Wethencomparedthesedatato theexpectedfrequenciesofdeleteriousallelesbasedonmodelsofMSBinorderto bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. evaluate the effects of demography and other mechanisms in influencing these frequencies. Results Mendelianrecessivediseasealleleset We relied on two datasets, one that describes 173 autosomal recessive diseases [17] and another from a genetic testing laboratory (Counsyl; <https://www.counsyl.com/>) that includes 110 recessive diseases of clinical interest. From these lists, we obtained a set of 44 “recessive lethal” diseases associated with 42 genes (Table S1) requiring that at least one of the following conditionsismet:(i)intheabsenceoftreatment,theaffectedindividualsdieofthe diseasebeforereproductiveage,(ii)reproductioniscompletelyimpairedinpatients ofbothsexes,(iii)thephenotypeincludesseverementalretardationthatinpractice precludes reproduction, or (iv) the phenotype includes severely compromised physical development, again precluding reproduction. Based on clinical genetics datasets and the medical literature (see Methods for details), we were able to confirmthat417SingleNucleotideVariants(SNVs)in32(ofthe42)geneshadbeen reported as associated with the severe form of the corresponding disease with an early-onset, and no indication of incomplete penetrance or effects in heterozygote carriers(TableS2).Bythisapproach,weobtainedasetofmutationsforwhich,at leastinprinciple,thereisnoheterozygoteeffect(i.e.,thedominancecoefficienth=0 inamodelwithgenotypefitnesses1,1-hsand1-s)andtheselectivecoefficientsfor therecessivealleleis1. bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. Alargesubsetofthesemutations(29.3%)consistsoftransitionsatCpGsites (henceforth CpGti), which occur at a highly elevated rates (~17-fold higher) compared to other mutation types, namely CpG transversions, and non-CpG transitionsandtransversions[16].ThisproportionisinagreementwithCooperand Youssoufian[18],whofoundthat~31.5%ofdiseasemutationsinexonsareCpGti. EmpiricaldistributionofdiseaseallelesinEurope Allele frequency data for the 417 variants were obtained from the Exome AggregationConsortium(ExAC)for60,706individuals[19].Outofthe417variants associatedwithputativerecessivelethaldiseases,threewerefoundhomozygousin at least one individual in this dataset (rs35269064, p.Arg108Leu in ASS1; rs28933375, p.Asn252Ser in PRF1; and rs113857788, p.Gln1352His in CFTR). Available data quality information for these variants does not suggest that these genotypes are due to calling artifacts (Table S2). Since these diseases have severe symptomsthatleadtoearlydeathwithouttreatmentandtheseExACindividualsare purportedly healthy (i.e., do not manifest severe diseases) [19], the reported mutations are likely errors in pathogenicity classification or cases of incomplete penetrance (see a similar observation for CFTR and DHCR7in [20]). We therefore excluded them from our analyses. In addition to the mutations present in homozygotes, we also filtered out sites that had lower coverage in ExAC (see Methods),resultinginafinaldatasetof385variantsin32genes(TableS2). Genotypesforasubset(N=91)ofthesemutationswerealsoavailablefora larger sample size (76,314 individuals with self-reported European ancestry) bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. generated by the company Counsyl (Table S3). A comparison of the allele frequencies in this larger dataset to that of ExAC data [19] showed excellent concordance (Wilcoxon signed-rank test for paired samples, p-value=0.34; Fig S1). Thus, both data sets appear to accurately reflect European frequencies of these disease alleles, despite slight differences in the populations included. In what follows,wefocusonExAC,whichincludesagreaternumberofdiseasemutations. Modelsofmutation-selectionbalance To generate expectations for the frequencies of these disease mutations under mutation-selection balance, we considered models of infinite and finite populations of constant size [3], and conducted forward simulations using a plausible demographic model for African and European populations [21] (see Methodsfordetails).Inthemodels,thereisawild-typeallele(A)andadeleterious allele (a, which could also represent a class of distinct deleterious alleles with the same fitness effect) at each site, such that the relative fitness of individuals of genotypesAA,Aa,oraaisgivenby: • wAA=1; • wAa=1-hs; • waa=1-s; ThemutationratefromAtoaisuandtherearenobackmutations. For a constant population of infinite size, Wright [22] showed that under theseconditions,thereexistsastableequilibriumbetweenmutationandselection, when the selection pressure is sufficiently strong (s>>u). In particular, when the bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. deleteriouseffectofalleleaiscompletelyrecessive(h=0),itsequilibriumfrequency qisgivenby: != !/!. (1) For a finite population of constant size, Nei [3] derived the mean (eq. 2) and variance (eq. 3) of the frequency of a fully recessive deleterious mutation (h=0) basedonadiffusionmodel,leadingto: q= Γ(2 Nu + 1 / 2) 2 Ns Γ(2 Nu ) σ q2 = u / s − q 2 , , (2) (3) whereNisthediploidpopulationsizeand𝛤 is the gamma function(seeSimonsetal. [1]forasimilarapproximation). In a finite population, the mean frequency, q , therefore depends on assumptionsaboutthepopulationmutationrate(2Nu).Ifthepopulationmutation rateishigh,suchthat2Nu>>1, q isapproximatedby q ≈ u / s , (4) whichisindependentofthepopulationsizeandequaltofrequencyexpectedinan infinite size population; in particular, for a recessive lethal mutation, the mean frequency is equal to the right hand side of eq. (1). The important difference betweenmodelsisthatinafinitepopulation,thereisadistributionoffrequenciesq (becauseofgeneticdrift),ratherthanasinglevalue,whosevarianceisgivenineq. (3). bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. In contrast, when the finite population has a low population mutation rate (2Nu<<1),themeanallelefrequency, q ,isapproximatedby: q ≈ u 2πN / s , which depends on the population size. In humans, the mutation rate at each base pairisverysmall(ontheorderof10-8),sothesecondapproximationshouldapply when considering each single site independently. The expectation and variance of thefrequencyofafatal,fullyrecessiveallele(i.e.,s=1,h=0)arethengivenby: q = u 2π N , (6) σ q2 = u − q 2 = u(1− 2π Nu) ≈ u . (7) and All models assume complete penetrance, complete recessivity (h = 0), lethality of homozygous mutations (s = 1) and consider a single site, thereby ignoringthepossibilityofcompoundheterozygosity(unlessotherwisenoted). Comparingmutation-selectionbalancemodels Althoughaninfinitepopulationsizehasoftenbeenassumedwhenmodeling deleteriousallelefrequencies(e.g.[5,23-26]),predictionsunderthisassumptioncan differ markedly from what is expected from models of finite population sizes, assuming plausible parameter values for humans. For example, the long-term estimateoftheeffectivepopulationsizefromtotalpolymorphismlevelsis~20,000 individuals (assuming a per base pair mutation rate of 1.2x10-8 [16] and diversity levels of 0.1% [27]). In this case, the average deleterious allele frequency in the bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. modeloffinitepopulationsizeis>25-foldlowerthanthatintheinfinitepopulation model(Fig1). Becausethehumanpopulationsizehasnotbeenconstant,andchangesinthe populationsizecanaffectthefrequenciesofdeleteriousallelesinthepopulation(as recently reviewed by Simons and Sella [2]), we also simulated mutation-selection balance under a plausible demographic model for the population history of European populations [21]. With our parameters, the mean allele frequency obtainedfromthismodelwas8.1x10-6,~2-foldhigherthanexpectedforaconstant population size model with Ne=20,000 (Fig 1). The mean frequency seen in simulationsinsteadmatchestheexpectationforaconstantpopulationsizeof~72k individuals (see Methods and Fig S2a). This finding is expected: the estimate of 20,000 is based on total genetic variation; assuming that most of this variation is neutral, it therefore reflects an average timescale over millions of years. For recessivelethalmutations,however,whicharerelativelyrapidlypurgedbynatural selection,amorerecenttimedepthisrelevant(e.g.,[1]).Indeed,inoursimulations, mostofthediseasealleles(65.6%)segregatingatpresentaroseveryrecently,such thattheywerenotsegregatinginthepopulation205generationsago,thetimepoint after which Ne is estimated to have increased from 9,300 to 512,000 individuals [21].Increasingtheeffectivepopulationsizeisnotenoughtomodeldiseasealleles appropriately however. For example, if we compare simulation results obtained underthemorecomplexTennessenetal.[21]demographicmodel[21]tothosefor simulationsofaconstantpopulationsizeofNe=72,348,themeanallelefrequencies match,butthedistributionsofallelefrequenciesaresignificantlydifferent(FigS2b). bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. These findings thus confirm the importance of incorporating demographic history intomodelsforunderstandingthepopulationdynamicsofdiseasealleles[5,28,29]. In what follows, we therefore test the fit of the results based on a more realistic demographicmodel[21]totheobservedallelefrequencies. Comparingempiricalandexpecteddistributionsofdiseasealleles The mutation rate, u, from wild-type allele to disease allele is a critical parameter in predicting the frequencies of a deleterious allele [4,30]. To model disease alleles, we considered four mutation types separately, with the goal of capturingmostofthefine-scaleheterogeneityinmutationrates[31-34]:transitions inmethylatedCpGsites(CpGti)andthreelessmutabletypes,namelytransversions inCpGsites(CpGtv)andtransitionsandtransversionsoutsideaCpGsite(nonCpGti and nonCpGtv, respectively). In order to control for the methylation status of CpG sites, which has an important influence on transition rates [31], we excluded 12 CpGti that occurred in CpG islands, which tend not to be methylated (following Moorjanietal.[35]).Toallowforheterogeneityinmutationrateswithineachoneof these four classes considered, we modeled the within-class variation in mutation ratesaccordingtoalognormaldistribution(seedetailsinMethodsand[33]). For each mutation type, we then compared results from the simulations to what is observed in ExAC, focusing on the largest sample of the same common ancestry, namely Non-Finnish Europeans (N = 33,370) (Fig 2). We find significant differencesbetweenempiricalandexpectedmeanfrequenciesfornonCpGti(51-fold on average; two-tailed p-value < 10-3; see Methods for details) and nonCpGtv (25- bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. fold on average, two-tailed p-value < 10-3), to a lesser extent for CpGtv (p-value = 0.05), but not for CpGti (p-value = 0.16). Intriguingly, the discrepancy between observedandexpectedfrequenciesbecomessmallerasthemutationrateincreases (Fig2). Two additional factors should further decrease the expected frequencies relative to our predictions, and will thus exacerbate the discrepancy observed between empirical and expected distribution of deleterious alleles. First, we have ignoredtheexistenceofcompoundheterozygosity,thecaseinwhichcombinations oftwo distinctpathogenic allelesinthesamegeneleadtolethality.Weknow that thisphenomenoniscommon[36],andindeed58.4%ofthe417diseasemutations considered in this study were initially identified in compound heterozygotes. Because of compound heterozygosity, each deleterious mutation will be selected againstnotonlywhenpresentintwocopieswithinthesameindividual,butalsoin the presence of certain lethal mutations at other sites of the same gene. Thus, the frequency of a deleterious mutation will be lower under a model incorporating compound heterozygosity than under a model that does not include this phenomenon. Inordertomodeltheeffectofcompoundheterozygosityinoursimulations, wereranoursimulationsconsideringageneratherthanasite.Inthesesimulations, we used the same setup as in the site level analysis, except for the mutation rate, here defined as U, the sum of the mutation rates uiat each site i that is known to cause a severe and early onset form of the disease (Table S2; see Methods for details).Thisapproachdoesnotconsiderthecontributionofothermutationsinthe bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. genes that cause the mild and/or late onset forms of the disease, and implicitly assumes that all combinations of known recessive lethal alleles of the same gene have the same fitness effect as homozygotes. Comparing observed frequencies to thosegeneratedbysimulation,onethirdofthegenesdifferatthe5%level,witha clear trend for empirical frequencies to be above expectation (Table S4; Fig 3; Fisher’scombinedprobabilitytestp-value<10-14). This finding is even more surprising than it may seem, because we do not know the complete mutation target for each gene and are therefore ignoring the contributionofadditionalsitesatwhichdiseasemutationscouldarise.Ifthereare undiscoveredrecessivelethalmutationsthatcausedeathwhenformingcompound heterozygoteswiththeascertainedones,thepurgingeffectofpurifyingselectionon the known mutations will be under-estimated, leading us to over-estimate the expectedfrequenciesinsimulations.Therefore,ourpredictionsare,ifanything,an over-estimate of the expected allele frequency and the discrepancy between predictedandtheobservedislikelyevenlargerthanwhatwefound. The other factor that we did not consider in simulations but would reduce the expected allele frequencies is a subtle fitness decrease in heterozygotes. To evaluatepotentialfitnesseffectsinheterozygoteswhennonehadbeendocumented in humans, we considered the phenotypic consequences of orthologous gene knockoutsinmouse.Wewereabletoretrieveinformationonphenotypesforboth homozygoteandheterozygotemiceforonlyeightoutofthe32genes,namelyASS1, CFTR, DHCR7, NPC1, POLG, PRF1, SLC22A5, and SMPD1. For all eight, homozygote knockoutmicepresentedsimilarphenotypesasaffectedhumans,andheterozygotes bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. showedamilderbutdetectablephenotype(TableS5).Theextenttowhichthesame phenomenon applies to the mutations with no clinically reported effects in heterozygous humans is unclear, but the finding with knockout mice makes it plausiblethatthereexistsaverysmallfitnessdecreaseinheterozygotesinhumans as well, potentially enough to have a marked impact on the allele frequencies of deleterious mutations, but not enough to have been recognized in clinical investigations.Indeed,evenifthefitnesseffectinheterozygoteswereh=1%,a68% decreaseinthemeanallelefrequencyofthediseasealleleisexpected(FigS3). Discussion To investigate the population genetics of human disease, we focused on mutations that cause Mendelian, recessive disorders that lead to early death or completely impaired reproduction. We sought to understand to what extent the frequencies of these mutations fit the expectation based on a simple balance between the input of mutations and the purging by purifying selection, as well as how other mechanisms might affect these frequencies. Many studies implicitly or explicitlycompareknowndiseaseallelefrequenciestoexpectationsfrommutationselection balance. We tested whether known disease alleles as a class fit these expectations, and found that, under a sensible demographic model for European populationhistorywithpurifyingselectiononlyinhomozygotes,theexpectationsfit the observed disease allele frequencies well when the mutation rate is high (see CpGtiinFig2).Ifmutationrateislow,however,asisthecaseforCpGtv,nonCpGti and nonCpGtv, the mean empirical frequencies of disease alleles are above bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. expectation(Fig2).Further,includingpossibleeffectsofthediseaseinheterozygote carriersorfortheeffectofcompoundheterozygotesinthesemodelsonlyincreases thediscrepancy. Inprinciple,higherthanexpecteddiseaseallelefrequenciescouldbeexplained by at least four (non-mutually exclusive) factors: (i) misspecification of the demographic model, (ii) widespread errors in reporting the causal variant, (iii) overdominance of disease alleles and (iv) low penetrance of disease mutations. Notably, it has been estimated that population growth in Europe could have been strongerthatweconsideredinoursimulations[37,38].Strongerpopulationgrowth doesincreasetheexpectedfrequencyofrecessivediseasealleles(FigS4,columnsAD).However,theimpactofmoreintensegrowthislikelyinsufficienttoexplainthe observeddiscrepancy: the allele frequencies observedinExAC are stillon average an order of magnitude larger than expected based on a model with ten-fold more intensegrowththanin[21](FigS4).Similarly,whileerrorsinreportingthecausal variants are known to be common [19,39,40], we attempted to minimize them by filtering out any case for which there was not compelling evidence of association with a recessive lethal disease, reducing our initial set of 769 mutations to 385 in whichwehadgreaterconfidence(seeMethodsfordetails). The other two factors, overdominance and low penetrance, are likely explanationsforasubsetofcases.Forinstance,CFTR,thegeneinwhichmutations lead to cystic fibrosis, is the furthest above expectation (p-value = 0.002; Fig 3). It waslongnotedthatthereisanunusuallyhighfrequencyoftheCFTRdeletionΔF508 in Europeans, which led to speculation that disease alleles of this gene may be bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. subjecttoover-dominance([41],butsee[42]).Whetherornotthisisthecase,there is evidence that disease mutations in this gene can complement one another [8,9] and that modifier loci in other genes also influence their penetrance [9,12]. Consistent with variable penetrance, Chen et al. [20] identified three purportedly healthyindividualscarryingtwocopiesofdiseasemutationsinthisgene.Similarly, DHCR7, the gene associated with the Smith-Lemli-Opitz syndrome, is somewhat above expectation in our analysis (p-value = 0.052; Fig 3) and healthy individuals were found to be homozygous carriers of putatively lethal disease alleles in other studies [20]. These observations make it plausible that, in a subset of cases (particularlyforCFTR),thehighfrequencyofdeleteriousmutationsassociatedwith recessive,lethaldiseasesareduetogeneticinteractionsthatmodifythepenetrance ofcertainrecessivediseasemutations. Nonetheless, two factors argue against modifiers alleles being a sufficient explanation for the site-level analysis. First, when we remove 130 mutations in CFTRand12inDHCR7,twogenesthatwereoutliersatthegene-level(Fig3;Table S4) and for which healthy individuals were reported to be homozygous for a deleterious allele [20], the discrepancy between observed and expected allele frequenciesisbarelyimpacted(FigS5).Second,thereisnoobviousreasonwhythis explanationwouldapplydifferentlytodistincttypesofmutations,yettheextentto which observed allele frequencies exceed the expected depends on the mutation rate,withnodepartureseenatCpGti(Fig2). Instead, it seems plausible that there is likely an ascertainment bias in disease allelediscoveryandmutationidentification.Sincenotallmutationsthatcancausea bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. specificMendeliandiseaseareknown,thosemutationsthatwerediscoveredtodate are likely the ones that by chance have drifted to higher frequencies, and are thus more likely to have been mapped and be reported in the medical literature. One clearimplicationofthatisthattherearenumeroussitesatwhichmutationscause recessive lethal diseases yet to be discovered, particular at non-CpG sites. More generally,thisascertainmentbiascomplicatestheinterpretationofobservedallele frequenciesintermsoftheselectionpressuresactingondiseasealleles. Beyondthisspecificpoint,ourstudyillustrateshowthelargesamplesizesnow madeavailabletoresearchersinthecontextofprojectslikeExAC[19]canbeused not only for direct discovery of disease variants, but also to test why there are diseaseallelessegregatinginthepopulationandtounderstandatwhatfrequencies wemightexpecttofindthem. Methods Diseasealleleset Inordertoidentifysinglenucleotidevariantswithinthe42genesassociated with lethal, recessive Mendelian diseases (Table S1), we initially relied on the ClinVardataset[43](accessedonJune3rd,2015).Wefilteredoutanyvariantthatis anindeloramorecomplexcopynumbervariantorthatiseverclassifiedasbenign orlikelybenigninClinVar(whetherornotitisalsoclassifiedaspathogenicorlikely pathogenic). By this approach, we obtained 769 SNVs described as pathogenic or likely pathogenic. For each one of these variants, we searched the literature for evidence that it is exclusively associated to the lethal and early onset form of the bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. disease and was never reported as causing the mild and/or late-onset form of the disease. We considered effects in the absence of medical treatment, as we were interested in the selection pressures acting on the alleles over evolutionary scales ratherthaninthelastoneortwogenerations.Variantswithmentionofincomplete penetrance (i.e. for which homozygotes were not always affected) or with known effects in heterozygote carriers were removed from the analysis. This process yielded 417 SNVs in 32 genes associated with distinct Mendelian recessive lethal disorders (Table S2). Although these mutations were purportedly associated with complete recessive diseases, we sought to examine whether there would be possible, unreported effects in heterozygous carriers. To this end, we used the Mouse Genome Database (MGD) [44] (accessed July 29th, 2015) and were able to retrieveinformationforbothhomozygoteandheterozygotemiceforeightoutofthe 32genes(allofwhichwithahomologueinmice)(TableS5). In addition to the information provided by ClinVar for each one of these variants,weconsideredtheimmediatesequencecontextofeachSNV,totailorthe mutationrateestimateaccordingly[16].Fordoingso,weusedanin-housePython script and the human genome reference sequence hg19 from UCSC (<http://hgdownload.soe.ucsc.edu/goldenPath/hg19/chromosomes/>). Geneticdatasets TheExomeAggregationConsortium(ExAC)[19]wasaccessedonAugust9th, 2016. The data consist of genotype frequencies for 60,706 individuals, assigned bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. afterPCAanalysistooneofsevenpopulationlabels:African(N=10,406),EastAsian (N=8,654), Finnish (N=6,614), Latino (N=11,578), Non-Finnish European (N=66,740),SouthAsian(N=16,512)and“other”(N=908).Wefocusedouranalyses onthoseindividualsofNon-FinnishEuropeandescent,becausetheyconstitutethe largest sample size from a single ancestry group. We note that, some diseases mutations,forinstance,thoseinASPA,HEXAandSMPD1,areknowntobeespecially prevalent in Ashkenazi Jewish populations, and that could potentially bias our resultsifAshkenaziJewishindividualsconstitutedagreatportionofthesamplewe considered. However, this sample includes only ~2,000 (~3%) Ashkenazi individuals(Dr.DanielMacArthur,personalcommunication). Fromtheinitial417mutations,wefilteredoutthreethatwerehomozygous in at least one individual in ExAC and 29 that had lower coverage, i.e., fewer than 80%oftheindividualsweresequencedtoatleast15x.Thisapproachleftuswitha set of 385 mutations with a minimum coverage of 27x per sample and an average coverage of 69x per sample (Table S2). For 248 sites with non-zero sample frequencies, ExAC reported the number of non-Finnish European individuals that weresequenced,whichwasonaverage32,881individuals[19].Fortheremaining 137 sites, we did not have this information. Nonetheless, the mean coverage is reportedforallsitesanddoesnotdifferbetweenthetwosetsofsites(FigS6).We thereforeassumedthatmeannumberofindividualscoveredforallsiteswas32,881 [45] and used this number to obtain sample frequencies from simulations, as explainedbelow. bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. A second genetic dataset was obtained from Counsyl (<https://www.counsyl.com/>). Counsyl is a commercial genetic screening laboratory that offers, among other products, the “Family Prep Screen”, a genetic screening test intended to detect carrier status for up to 110 recessive Mendelian diseases in couples that are planning to have a child. A subset of 294,000 of its customers was surveyed by genotyping or sequencing for “routine carrier screening”.Thissubsetexcludesindividualswithindicationsfortestingbecauseof known personal or family history of Mendelian diseases, infertility, and consanguinity.Itthereforerepresentsamorerandom(withregardtothepresence ofdiseasealleles),population-basedsurvey.Fortheseindividuals,wehaddetailson self-reported ancestry (14 distinct ethnic/ancestry/geographic groups) and the allele frequencies for 98 mutations that match those that passed our variant selectioncriteriadescribedabove,ofwhich92arealsosequencedtohighcoverage in the ExAC database (Table S2). We focused our analysis of this dataset on the 76,314individualswithself-reportedNorthernorSouthernEuropeanancestry. Simulatingtheevolutionofdiseasealleleswithpopulationsizechange Wealsomodeledthefrequencyofadeleteriousalleleinhumanpopulations by forward simulations based on a crude but plausible demographic model for humanpopulationsfromAfricaandEurope,inferredfromexomedataforAfricanAmericansandEuropean-Americans[21].Tothisend,weusedaprogramdescribed in[1].Inbrief,thedemographicscenarioconsistsofanOut-of-Africademographic model,withchangesinpopulationsizethroughoutthepopulationhistory,including bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. aseverebottleneckinEuropeansfollowingthesplitfromtheAfricanpopulationand arapid,recentpopulationgrowthinbothpopulations[21].AsinSimonsetal.[1], wesimulatedgeneticdriftandtwo-waygeneflowbetweenAfricansandEuropeans inrecenthistory.Negativeselectionactingonasinglebi-allelicsitewasmodeledas intheanalyticmodels. Allele frequencies follow a Wright-Fisher sampling scheme in each generation according to these viabilities, with migration rate and population sizes varying according to the demographic scenario considered. Whenever a demographicevent(e.g.growth)alteredthenumberofindividualsandtheresulting numberwasnotaninteger,weroundedittothenearestinteger,asinSimonsetal. [1].Aburn-inperiodof10NegenerationswithconstantpopulationsizeNe=7,310 individuals was implemented in order to ensure an equilibrium distribution of segregatingallelesattheonsetofdemographicchangesinAfrica,148Kya. In contrast to Simons et al. [1], our simulations always start with the ancestral allele A fixed and mutation occurs exclusively from this allele to the deleteriousone(a),i.e.amutationoccurswithmeanprobabilityupergamete,per generation,andthereisnoback-mutation.However,recurrentmutationsatasite areallowed,asinSimonsetal.[1]. When implementing the model, we used mean mutation rates u estimated fromalargehumanpedigreestudy[16],consideringthegenomicaverage(1.2x10-8 perbasepair,pergeneration)andfourdistinctmutationtypes(CpGti=1.12x10-7; CpGtv=9.59x10-9;nonCpGti=6.18x10-9;andnonCpGtv=3.76x10-9).Whilethese four categories capture much of the variation in germline mutation rates across bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. sites,anumberofotherfactors(e.g.,thelargersequencecontextorthereplication timing) also influence mutation rates, introducing heterogeneity in the mutation ratewithineachclassconsidered[31-33,46].Toallowforthisheterogeneityaswell asforuncertaintyinthepointmutationratesestimates,ineachsimulation,instead ofusingafixedrateuforeachmutationtype,wedrewthemutationrateMfroma lognormaldistributionwiththefollowingparameters: σ2 log10 M | u ~ N(log10 u − ln(10), σ 2 ) 2 (7) suchthatthatE[M]=u.σwassetto0.57(following[33]). By this procedure, we ran two million simulations for each mutation type, thus obtaining the distribution of deleterious allele frequencies expected for the Europeanpopulation.Inordertocomparesimulationresultstotheempiricaldata, we subsampled the simulated population to match the average number of autosomal chromosomes in the non-Finnish European sample from ExAC (N = 65,762chromosomes). Tomeasurethesignificanceofthedeviationbetweenobservedandexpected allelefrequencies,weproceededasfollows:First,wesampledKallelefrequencies from the 2M simulations implemented for each mutation type, where K is the number of mutations described for that type. We repeated this step 1,000 times, thus obtaining a distribution for the mean allele frequency across K mutations. Finally,toobtainatwo-tailedp-value,weconsideredtherankoftheempiricalmean relativetosimulatedoutcomes. bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. Awell-knownsourceofheterogeneityinmutationratewithintheCpGticlass ismethylationstatus,withahightransitionrateseenonlyatmethylatedCpGs[18]. In our analyses, we tried to control for the methylation status of CpG sites by excluding sites located in CpG islands (CGIs). The CGI annotation for hg19 was obtained from UCSC Genome Browser (track “Unmasked CpG”; <http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/cpgIslandExtUnmas ked.txt.gz>, accessed in June 6th, 2016) and BEDTools [47] was used to exclude thoseCpGsiteslocatedinCGIs.WenotethattheCpGtiestimatefrom[16]includes CGIs,andinthatsensetheaveragemutationratethatweareusingforCpGtimaybe averyslightunderestimateofthemeanratefortransitionsatmethylatedCpGsites. Unlessotherwisenoted,theexpectationassumesfullyrecessive,lethalalleles with complete penetrance. Notably, by calculating the expected frequency one site at a time, we are ignoring possible interaction between genes (i.e., effects of the genetic background) and among different mutations within a gene (i.e., compound heterozygotes).Theseassumptionsarerelaxedintwoways.Inoneanalysis(FigS3), we consider a very low selective effect in heterozygous individuals (h = 1%), reasoningthatsuchaneffectcouldplausiblygoundetectedinmedicalexaminations andyetwouldnonethelessimpactthefrequencyofthediseaseallele.Second,when considering the gene-level analysis (Fig 3), we implicitly allow for compound heterozygositybetweenanypairoflethalmutations.Forthisanalysis,weran1000 simulationsforatotalmutationrateUpergenethatwascalculatedaccountingfor theheterogeneityanduncertaintyinthemutationratesestimatesasfollows:(i)For sitesknowntocausearecessivelethaldiseaseandthatpassedourfilteringcriteria bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. (TableS2),wedrewamutationrateuifromthelognormaldistribution,asdescribed above;(ii)WethentookthesumofuiasthetotalmutationrateU;(iii)Wethenran onereplicatewithUasthemutationparameter,andotherparametersasspecified forsitelevelanalysis.Becausethemutationaltargetsizeconsideredinsimulations is only comprised of those sites at which mutations are known to cause a lethal recessivedisease,itisalmostcertainlyanunderestimateofthetruemutationrate— potentially by a lot. We note further that by this approach, we are assuming that compoundheterozygotesformedbyanytwolethalalleleshavefitnesszero,i.e.,that they are identical in their effects to homozygotes for any of the lethal alleles. Moreover, we are implicitly ignoring the possibility of complementation, which is (somewhat) justified by our focus on mutations with severe effects and complete penetrance (but see Discussion). Since we were interested in understanding the effect of compound heterozygosity, for this analysis, we did not consider the five genesinwhichonlyonemutationpassedourfilters(BCS1L,FKTN,LAMA3,PLA3G6, andTCIRG1). AllcodesanddatatogeneratethefiguresinR[48]andthescriptusedtoget the sequence context of each mutation (kindly provided by Ellen Leffler) are available at https://github.com/cegamorim/PopGenHumDisease. The code to run the simulations is available at https://github.com/sellalab/PopGenHumDisease. Allelefrequenciesandotherinformationforthediseasemutationsemployedinthe analysesareinTablesS2andS3. bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. Acknowledgements We thank Dr. Daniel MacArthur for providing information on accessing ExAC information. CEGA was partially funded by a Science Without Borders fellowship fromCAPESfoundation(BEX8279/11-0)andCNPq(PDE201145/2015-4),Brazil. ZGwaspartiallysupportedbyapostdoctoralfellowshipfundedbyStanfordCenter forComputational,EvolutionaryandHumanGenomics.JFDwasfundedbyaScience WithoutBordersfellowshipfromCAPESfoundation(88888.038761/2013-00).The work was partially supported by a Research Initiative in Science and Engineering grant from Columbia University to JKP and MP. The computing in this project was supported by two National Institutes of Health instrumentation grants (S10OD012351andS10OD021764)receivedbytheDepartmentofSystemsBiology atColumbiaUniversity.Thefundershadnoroleinstudydesign,datacollectionand analysis,decisiontopublish,orpreparationofthemanuscript. AuthorContributions Conceivedthestudy:JP,MP.Designedthestudy:CEGA,ZG,MP.Analyzedthedata: CEGA.Implementedanalyticalmodels:ZG.Wrotethepaper:CEGA,ZG,MP.Helped inacquisitionandanalysisofdata:ZB,JFD.Contributedanalyticaltoolsordata:YBS, IR. bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. References 1. Simons YB, Turchin MC, Pritchard JK, Sella G (2014) The deleterious mutation loadisinsensitivetorecentpopulationhistory.NatGenet46:220-224. 2. Simons YB, Sella G (2016) The impact of recent population history on the deleterious mutation load in humans and close evolutionary relatives. bioRxiv. 3. Nei M (1968) The frequency distribution of lethal chromosomes in finite populations.ProcNatlAcadSciUSA60:517-524. 4. Gillespie JH (2004) Population Genetics: A Concise Guide. Baltimore, MD: Johns HopkinsUniversityPress. 5. Brandvain Y, Wright SI (2016) The Limits of Natural Selection in a NonequilibriumWorld.TrendsGenet32:201-210. 6.BalickDJ,DoR,CassaCA,ReichD,SunyaevSR(2015)DominanceofDeleterious AllelesControlstheResponsetoaPopulationBottleneck. 7.BeauchampKA,MuzzeyD,WongKK,HoganGJ,KarimiK,etal.(2016)Systematic DesignandComparisonofExpandedCarrierScreeningPanels.bioRxiv. 8. Cormet-Boyaka E, Jablonsky M, Naren AP, Jackson PL, Muccio DD, et al. (2004) Rescuing cystic fibrosis transmembrane conductance regulator (CFTR)processingmutantsbytranscomplementation.ProcNatlAcadSciUSA101: 8221-8226. 9.RapinoD,SabirzhanovaI,Lopes-PachecoM,GroverR,GugginoWB,etal.(2015) Rescue of NBD2 mutants N1303K and S1235R of CFTR by small-molecule correctorsandtranscomplementation.PLoSOne10:e0119796. 10. Andressoo JO, Jans J, de Wit J, Coin F, Hoogstraten D, et al. (2006) Rescue of progeriaintrichothiodystrophybyhomozygouslethalXpdalleles.PLoSBiol 4:e322. 11.GallatiS(2014)Disease-modifyinggenesandmonogenicdisorders:experience incysticfibrosis.ApplClinGenet7:133-146. 12.CorvolH,BlackmanSM,BoellePY,GallinsPJ,PaceRG,etal.(2015)Genome-wide associationmeta-analysisidentifiesfivemodifierlocioflungdiseaseseverity incysticfibrosis.NatCommun6:8382. 13.HabaraA,SteinbergMH(2016)Minireview:Geneticbasisofheterogeneityand severityinsicklecelldisease.ExpBiolMed(Maywood)241:689-696. 14. Hedrick PW (2011) Population genetics of malaria resistance in humans. Heredity(Edinb)107:283-304. 15.ChongJX,BuckinghamKJ,JhangianiSN,BoehmC,SobreiraN,etal.(2015)The Genetic Basis of Mendelian Phenotypes: Discoveries, Challenges, and Opportunities.AmJHumGenet97:199-215. 16.KongA,FriggeML,MassonG,BesenbacherS,SulemP,etal.(2012)Rateofde novo mutations and the importance of father's age to disease risk. Nature 488:471-475. 17. Turner TN, Douville C, Kim D, Stenson PD, Cooper DN, et al. (2015) Proteins linked to autosomal dominant and autosomal recessive disorders harbor characteristicraremissensemutationdistributionpatterns.HumMolGenet. bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. 18. Cooper DN, Youssoufian H (1988) The CpG dinucleotide and human genetic disease.HumGenet78:151-155. 19.LekM,KarczewskiKJ,MinikelEV,SamochaKE,BanksE,etal.(2016)Analysisof protein-codinggeneticvariationin60,706humans.Nature536:285-291. 20.ChenR,ShiL,HakenbergJ,NaughtonB,SklarP,etal.(2016)Analysisof589,306 genomes identifies individuals resilient to severe Mendelian childhood diseases.NatBiotechnol34:531-538. 21.TennessenJA,BighamAW,O'ConnorTD,FuW,KennyEE,etal.(2012)Evolution and functional impact of rare coding variation from deep sequencing of humanexomes.Science337:64-69. 22.WrightS(1937)TheDistributionofGeneFrequenciesinPopulations.ProcNatl AcadSciUSA23:307-320. 23. Reich DE, Lander ES (2001) On the allelic spectrum of human disease. Trends Genet17:502-510. 24. Pritchard JK, Cox NJ (2002) The allelic architecture of human disease genes: commondisease-commonvariant...ornot?HumMolGenet11:2417-2423. 25. Zwick ME, Cutler DJ, Chakravarti A (2000) Patterns of genetic variation in Mendelianandcomplextraits.AnnuRevGenomicsHumGenet1:387-407. 26.CassaCA,WeghornD,BalickDJ,JordanDM,NusinowD,etal.(2016)Estimating the Selective Effect of Heterozygous Protein Truncating Variants from HumanExomeData.bioRxiv. 27.TheGenomesProjectC(2015)Aglobalreferenceforhumangeneticvariation. Nature526:68-74. 28. Lohmueller KE, Indap AR, Schmidt S, Boyko AR, Hernandez RD, et al. (2008) Proportionally more deleterious genetic variation in European than in Africanpopulations.Nature451:994-997. 29.PeischlS,DupanloupI,BosshardL,ExcoffierL(2016)Geneticsurfinginhuman populations:fromgenestogenomes.CurrOpinGenetDev41:53-61. 30. Pritchard JK, Cox NJ (2002) The allelic architecture of human disease genes: commondisease-commonvariant...ornot? 31. Segurel L, Wyman MJ, Przeworski M (2014) Determinants of mutation rate variationinthehumangermline.AnnuRevGenomicsHumGenet15:47-70. 32. Aggarwala V, Voight BF (2016) An expanded sequence context model broadly explains variability in polymorphism levels across the human genome. Nat Genet48:349-355. 33.HarpakA,BhaskarA,PritchardJK(2016)Effectsofvariablemutationratesand epistasisonthedistributionofallelefrequenciesinhumans.bioRxiv. 34.HodgkinsonA,Eyre-WalkerAVariationinthemutationrateacrossmammalian genomes. 35. Moorjani P, Amorim CE, Arndt PF, Przeworski M (2016) Variation in the molecularclockofprimates.ProcNatlAcadSciUSA113:10607-10612. 36. Kamphans T, Sabri P, Zhu N, Heinrich V, Mundlos S, et al. (2013) Filtering for compound heterozygous sequence variants in non-consanguineous pedigrees.PLoSOne8:e70151. bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. 37. Gazave E, Ma L, Chang D, Coventry A, Gao F, et al. (2014) Neutral genomic regions refine models of recent rapid human population growth. Proc Natl AcadSciUSA111:757-762. 38. Gao F, Keinan A (2016) Explosive genetic evidence for explosive human populationgrowth.CurrOpinGenetDev41:130-139. 39.XueY,ChenY,AyubQ,HuangN,BallEV,etal.(2012)Deleterious-anddiseaseallele prevalence in healthy individuals: insights from current predictions, mutationdatabases,andpopulation-scaleresequencing.AmJHumGenet91: 1022-1032. 40.PitonA,RedinC,MandelJL(2013)XLID-causingmutationsandassociatedgenes challenged in light of data from large-scale human exome sequencing. Am J HumGenet93:368-383. 41.GabrielSE,BrigmanKN,KollerBH,BoucherRC,StuttsMJ(1994)Cysticfibrosis heterozygote resistance to cholera toxin in the cystic fibrosis mouse model. Science266:107-109. 42. Quinton PM (1994) Human genetics. What is good about cystic fibrosis? Curr Biol4:742-743. 43. Landrum MJ, Lee JM, Riley GR, Jang W, Rubinstein WS, et al. (2014) ClinVar: public archive of relationships among sequence variation and human phenotype.Nucleicacidsresearch42:D980-D985. 44.EppigJT,BlakeJA,BultCJ,KadinJA,RichardsonJE(2015)TheMouseGenome Database (MGD): facilitating mouse as a model for human biology and disease.NucleicAcidsRes43:D726-736. 45.HaqueIS,LazarinGA,KangH,EvansEA,GoldbergJD,etal.(2016)MOdeledfetal riskofgeneticdiseasesidentifiedbyexpandedcarrierscreening.JAMA316: 734-742. 46. Mugal CF, Ellegren H (2011) Substitution rate variation at human CpG sites correlates with non-CpG divergence, methylation level and GC content. GenomeBiol12:R58. 47.QuinlanAR,HallIM(2010)BEDTools:aflexiblesuiteofutilitiesforcomparing genomicfeatures.Bioinformatics26:841-842. 48.RCoreTeam(2015)R:ALanguageandEnvironmentforStatisticalComputing. Vienna,Austria. 49. Online Mendelian Inheritance in Man O (2016). Baltimore, MD: McKusickNathansInstituteofGeneticMedicine,JohnsHopkinsUniversity. 1e+04 1e+02 1e+03 Density 1e+05 1e+06 bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. 0 −6 −5 −4 −3 Log10(Allele frequency) Fig 1. Expected allele frequencies in the population, under a mutationselection balance models. Blue bar denotes the expected under an infinite populationsize,agreenbarunderafiniteconstantpopulation,andaredbarunder aplausibledemographicmodelforEuropeanpopulations(distributionshowninthe greyhistogram).Allmodelsassumes=1andh=0.Forthefiniteconstantpopulation sizemodel,wepresentthemeanfrequencyexpectedforapopulationsizeof20,000 (seeFigS2afortheeffectofvaryingthepopulationsize).Sampleallelefrequencies (q) were transformed to log10(q) and those q=0 were set to 10-7 for visual purposes,butindicatedas“0”ontheX-axis.Y-axisisplottedonalog-scale. bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. −5 −4 −3 −2 SIM −6 81% − 0 − Log10(Frequency) −3 −4 −5 − −6 0 Log10(Frequency) −2 −1 CpGtv (n=13) p−value=0.05 −1 CpGti (n=101) p−value=0.16 98% 33% ExAC SIM ExAC −1 ExAC −3 −2 SIM −4 −5 70% −6 98% − − 0 − Log10(Frequency) −2 −3 −6 −5 −4 − 0 Log10(Frequency) 54% nonCpGtv (n=104) p−value<1e−3 −1 nonCpGti (n=155) p−value<1e−3 − 99% 59% SIM ExAC Fig 2. Expected and observed distributions of disease allele frequencies. The four panels correspond to four different mutation types. The title of the panel indicatesthemutationtype,followedbythetotalnumberofmutations,withp-value for the difference between observed and expected mean frequencies below it. Results in grey (SIM) were obtained from simulations considering a plausible demographicmodelforEuropeanpopulations[21](seetext).Theobservedvalues estimated from 33,370 individuals of European ancestry from ExAC are shown in white. Violin plots show the density distribution of variable sites, while boxes indicatetheproportionofsitesforwhichthewild,non-deleteriousmutationisfixed. Means (considering both segregating and fixed mutations) are indicated with red horizontalbars. −−− −2 −3 − − −− −4 − −5 − − − − − − − − − − − − − − − − − − 100% 75% 50% 25% 0 FA H LA M B ER 3 C C 8 AS PA H SD 17 B PP 4 T1 D H C R 7 PE X7 PR F1 AS SM S1 AR C AL N PC 1 1 G BE 1 G AL C TK 2 ST AR C LN 5 TP P1 G AA SL C 2 PO 2A M 5 G N T1 G AN H EX A PD PO 1 LG UA SM ID C FT R 0 Log10(Frequency) − FigS3.Diseaseallelefrequenciesatthegenelevel.Theexpectation(grey)isbasedon1000simulations,assumingnoeffect in heterozygotes, but allowing for compound heterozygosity (see Methods for details). Observed frequencies (purple bars) wereobtainedfromExACconsidering33,370Europeanindividuals.Genesareorderedaccordingtothetwo-tailedp-valuefor thesignificanceofthedeviationoftheempiricalfrequenciesfromtheexpected(TableS4).Tocalculatethat,weconsideredthe rankoftheempiricalmeanrelativetothesimulations.Violinplotsshowthedistributionofsimulatedallelefrequenciesamong segregating alleles and boxes represent the fraction of simulations in which no deleterious allele was observed in the simulatedsampleatpresent. TableS5.Phenotypiceffectofmouseknock-outs(seemaintext) Gene ASS1 Humandisease Citrullinemia OMIM number Phenotypeofaffectedhuman casesa 215700 Veryhighconcentrationofthe amino-acidcitrullineinserum, spinalfluid,andurine. 219700 Disruptionofexocrinefunctionof thepancreas,intestinalglands (meconiumileus),biliarytree (biliarycirrhosis),bronchialglands (chronicbronchopulmonary infectionwithemphysema)and sweatglands(highsweatelectrolyte withdepletioninahot environment).Infertilityoccursin malesandfemales. CFTR Cysticfibrosis DHCR7 Smith-Lemli-Opitz syndrome 270400 Multiplecongenitalmalformation andmentalretardationsyndrome. NPC1 Niemann-Pick disease,typeC1 257220 Lipidstoragedisordercharacterized byprogressiveneurodegeneration. Phenotypeofhomozygous knockoutmiceb Completeneonatallethality, abnormalcirculatingamino-acid level,increasedcirculating ammonialevel. Partialpostnatallethality, aphagia,pancreaticacinarcell atrophy,abnormalintestine morphology,abnormaldigestive systemphysiology,abnormal glandmorphology,acute pancreasinflammation,weight loss,distendedabdomen, abnormalionhomeostasis, enlargedgallbladder,abnormal respiratorysystemphysiology, lacrimalglandatrophy. Completeneonatallethality, abnormalsucklingbehavior, weakness,abnormalnasal cavitymorphology,fetalgrowth retardation,cyanosis,abnormal braindevelopment,distended urinarybladder. Prematuredeath,abnormal Purkinjecellmorphology, increasedbraincholesterol level,increasedlivercholesterol level,abnormalmacrophage morphology,abnormal microglialcellactivation, abnormallipidhomeostasis, decreasedbodyweight, Phenotypeof heterozygousknockout miceb Abnormalcirculating amino-acidlevel. Impairedfertilization, decreasedlittersize. Abnormalcholesterollevel, syndactyly,partial embryoniclethality, decreasedbrainsize. Increasedbraincholesterol level. POLG Alperssyndrome 203700 PRF1 Hemophagocytic 603553 lymphohistiocytosis SLC22A5 Carnitinedeficiency 212140 SMPD1 Niemann-Pick disease,typeA 257200 impairedcoordination. Clinicaltriadofpsychomotor Prematuredeath,abnormal retardation,intractableepilepsy,and mitochondrialphysiology, liverfailureininfantsandyoung decreasedthymocytenumber, children.Pathologicfindingsinclude abnormallymphopoiesis, neuronallossinthecerebralgray macrocyticanemia,abnormal matterwithreactiveastrocytosis erythroidlineagecell andlivercirrhosis. morphology. IncreasedactivatedTcell number,decreasedcytotoxicT cellcytolysis,abnormalcytokine secretion,decreased susceptibilitytoautoimmune Immunedysregulationcharacterized diabetes,increased clinicallybyfever,edema, susceptibilitytoviralinfection, hepatosplenomegaly,andliver prematuredeath,complete dysfunction.Neurologicimpairment, postnatallethality,liver seizures,andataxiaarefrequent. inflammation,CNS inflammation,abnormal circulatingcytokinelevel, decreasedleukocytecell number. Thisresultsinimpairedfattyacid Prematuredeath,enlargedliver, oxidationinskeletalandheart hepaticsteatosis,increased muscle.Inaddition,renalwastingof triglyceridelevel,decreased carnitineresultsinlowserumlevels circulatingcarnitinelevel, anddiminishedhepaticuptakeof impairedlipolysis,decreased carnitinebypassivediffusion,which bodyweight,enlargedheart. impairsketogenesis. Prematuredeath,ataxia, TheclinicalphenotypefortypeA lethargy,abnormalapoptosis, rangesfromasevereinfantileform decreasedbodyweight, withneurologicdegeneration increasedmacrophagederived resultingindeathusuallyby3years foamcellnumber,abnormal ofage. lipidhomeostasis,increased susceptibilitytobacterial Abnormalbonemarrowcell physiology,increasedBcell derivedlymphoma incidence. Insulitis,periinsulitis, impairednaturalkillercell mediatedcytotoxicity. Decreasedcirculating carnitinelevel,impaired lipolysis. Abnormalimmunesystem cellmorphology,abnormal neurondifferentiation, abnormaldepressionrelatedbehavior. infection,decreasedbrainsize. Phenotypesobtainedfrom[49]aand[44]b −4 − 23% 20% ExAC Counsyl −5 − 0 Log10(Frequency) −3 −2 bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. FigS1.FrequencydistributionofdiseasemutationsinindividualsofEuropean ancestry. Shown are allele frequencies for 92 variants associated with lethal, recessive diseases, as estimated from 33,370 individuals non-Finnish, Europeanancestry in the Exome Aggregation Consortium (ExAC) database [19] and 76,314 European-ancestry individuals from a genetic testing laboratory (Counsyl) (see Methods).Allelefrequenciesdonotdiffersignificantlybetweendatasets(Wilcoxon signed-rank test for paired samples, p-value=0.34). Violin plots show the density distribution of alleles segregating in these samples, while boxes indicate the proportionofsitesforwhichthedeleteriousmutationwasnotobserved. bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. 2.0e−05 1.0e−05 0.0e+00 Deleterious allele frequency a 0e+00 1e+05 2e+05 3e+05 4e+05 5e+05 Population size −4 −6 0 Log10(Frequency) −2 b − − 87% 98% SIM FIN bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. Fig S2. Comparisons of SIM and FIN models. (A) Mean allele frequency as a function of effective population size, under the FIN model. The X-axis range corresponds to the minimum and maximum effective population size estimated in [21].Theredbarindicatesthevalueofaconstantpopulationsizeatwhichthemean allelefrequencypredictedunderaconstantpopulationsizemodelisthesameasthe meanallelefrequencyestimatedinsimulations(SIM),foranaveragemutationrate of 1.20x10-8 [16]. Even for this value, the mean matches the constant size expectation,butthedistributionsdiffer(seepanel“b”below).(B)SIMreferstothe complexdemographicscenarioinferredbyTennessenetal.[21]fortheevolutionof European populations (see Methods). In the finite, constant size population model (FIN),Nissetto72,348individuals,sothatthemeanallelefrequency(redbars)is thesameasinsimulations(seepanel“a”above).Togenerateadistributionforthis model, we simulated a constant population size, as described in Methods for the Tennessen et al. model. Violin plots show the distribution of allele frequencies amongsegregatingalleles,whiletheboxesindicatetheproportionofsitesatwhich the non-disease allele is fixed. These models assume strong selection (s=1) and completerecessivity(h=0).Ascanbeseen,thetwodistributionsdiffersignificantly (Kolmogorov-Smirnovtest,p-value<10-15). −4 −5 − − 0 −6 Log10(Frequency) −3 −2 −1 bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. 87% 94% h=0 h = 1% Fig S3. The impact on disease allele frequencies of a small fitness effect in heterozygotes (h=0.01). Shown is the distribution generated from simulations. Means are represented by red horizontal bars. Violin plots show the density distribution of alleles segregating in these samples, whereas boxes indicate the proportion of sites for which the deleterious mutation was not observed. When a small fitness effect in heterozygotes is considered in the simulations, the mean decreases by 68% and a larger proportion of sites are not segregating. The two distributionsdiffersignificantly(p-value<10-15byaKolmogorov-Smirnovtest). −4 −5 − − − − − 0 Log10(Frequency) −3 −2 bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. 94% 96% 95% 88% 55% A B C D ExAC FigS4.EffectofvaryingtheendpopulationsizeintheSIMmodel.Tennessenet al. [21] inferred the present effective population size of Europeans to be 512,000 individuals (column A). We considered the effect of larger population sizes (2-, 4-, and 10-fold increase, denoted by columns B, C and D respectively), keeping other parameters the same. The observed allele frequency distribution of 385 disease mutationsinExACisshowninwhite.Violinplotsshowthedensitydistributionof allelessegregatinginthesesamples,whereasboxesindicatetheproportionofsites for which the deleterious mutation was not observed. All distributions differ significantly from one another (i.e., all p-values are < 10-15 by a KolmogorovSmirnovtest). bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. −4 −3 −2 SIM −5 Log10(Frequency) 36% 81% − 98% 67% SIM ExAC ExAC ExAC −3 −2 SIM −4 −5 70% Log10(Frequency) 98% − −6 −6 − 0 −5 −4 − − 0 −3 −2 −1 nonCpGtv (n=59) p−value<0.01 −1 nonCpGti (n=83) p−value=0.01 Log10(Frequency) − −6 − 0 −3 −4 −5 − −6 0 Log10(Frequency) −2 −1 CpGtv (n=9) p−value=0.02 −1 CpGti (n=80) p−value=0.30 99% SIM 63% ExAC Fig S5. Expected and observed distributions of disease allele frequencies (excluding mutations in CFTR and DHCR7). The four panels correspond to four differentmutationtypes.Thetitleofthepanelindicatesthemutationtype,followed bythetotalnumberofmutations,withp-valueforthedifferencebetweenobserved andexpectedmeanfrequenciesbelowit.Resultsingrey(SIM)wereobtainedfrom simulations considering a plausible demographic model for European populations [21](seetext).Theobservedvaluesestimatedfrom33,370individualsofEuropean ancestry from ExAC are shown in white. As opposed to Fig 1, we did not include mutationspresentintwogenes(CFTRandDHCR7)thatwereoutliersinthegenelevelanalysis(Fig3)andwerereportedelsewhere[20]tohavehealthyindividuals found to be homozygous for a deleterious allele. As in Fig 1, violin plots show the densitydistributionofvariablesites,whileboxesindicatetheproportionofsitesfor which the wild, non-deleterious mutation is fixed. Means (considering both segregatingandfixedmutations)areindicatedwithredhorizontalbars. bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. 70 50 30 Coverage 90 A B Fig S6. Depth of coverage for 385 mutations in ExAC known to cause lethal, Mendeliandiseases.Boxplotsshowthemean(blackbar)andthelowerandupper quartilesfor(A)the248siteswithnon-zerosamplefrequenciesinExAC,forwhich the number of sequenced non-Finnish European individuals was reported (N = 32,881)and(B)forthe137sitesforwhichwedidnothavethisinformation.Since distributions of depth of coverage are similar between datasets, we assumed that 32,881individualsweresequencedonaverageatallsites,andusedthisnumberto subsamplesimulationstomatchthesamplesizeoftheExACdata.