Download The population genetics of human disease: the case of recessive

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Designer baby wikipedia , lookup

Polymorphism (biology) wikipedia , lookup

Mutagen wikipedia , lookup

Fetal origins hypothesis wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Inbreeding wikipedia , lookup

Oncogenomics wikipedia , lookup

Genome (book) wikipedia , lookup

Tay–Sachs disease wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Koinophilia wikipedia , lookup

Human genetic variation wikipedia , lookup

Public health genomics wikipedia , lookup

Dominance (genetics) wikipedia , lookup

Epistasis wikipedia , lookup

Hardy–Weinberg principle wikipedia , lookup

Mutation wikipedia , lookup

Frameshift mutation wikipedia , lookup

Genetic drift wikipedia , lookup

Point mutation wikipedia , lookup

Population genetics wikipedia , lookup

Microevolution wikipedia , lookup

Transcript
bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
Thepopulationgeneticsofhumandisease:the
caseofrecessive,lethalmutations
Carlos Eduardo G. Amorim1,2*, Ziyue Gao3, Zachary Baker4, José
FranciscoDiesel5,YuvalB.Simons1,ImranS.Haque6,JosephPickrell1,7+,
MollyPrzeworski1,4+
1DepartmentofBiologicalSciences,ColumbiaUniversity,NewYork,NY10027
2CAPESFoundation,MinistryofEducationofBrazil,Brasília,DF,Brazil70040.
3HowardHughesMedicalInstitution,StanfordUniversity,Stanford,CA94305
4DepartmentofSystemsBiology,ColumbiaUniversity,NewYork,NY10027
5UniversidadeFederaldeSantaMaria,SantaMaria,RS,Brazil97105
6
Counsyl,180KimballWay,SouthSanFrancisco,CA
7
NewYorkGenomeCenter,NewYork,NY10013
+Theseauthorsco-supervisedthiswork
*Towhomcorrespondenceshouldbeaddressed([email protected])
bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
Abstract
Do the frequencies of disease mutations in human populations reflect a simple
balance between mutation and purifying selection? What other factors shape the
prevalenceofdiseasemutations?Tobegintoanswerthesequestions,wefocusedon
one of the simplest cases: recessive mutations that alone cause lethal diseases or
complete sterility. To this end, we generated a hand-curated set of 417 Mendelian
mutationsin32genes,reportedtocausearecessive,lethalMendeliandisease.We
thenconsideredanalyticmodelsofmutation-selectionbalanceininfiniteandfinite
populations of constant sizes and simulations of purifying selection in a more
realistic demographic setting, and tested how well these models fit allele
frequencies estimated from 33,370 individuals of European ancestry. In doing so,
we distinguished between CpG transitions, which occur at a substantially elevated
rate, and other mutation types. Whereas observed frequencies for CpG transitions
areclosetoexpectation,thefrequenciesobservedforothermutationtypesarean
order of magnitude higher than expected; this discrepancy is even larger when
subtlefitnesseffectsinheterozygotesorlethalcompoundheterozygotesaretaken
into account. In principle, higher than expected frequencies of disease mutations
could be due to widespread errors in reporting causal variants, compensation by
othermutations,orbalancingselection.Itisunclearwhythesefactorswouldaffect
CpG transitions differently from other mutations, however. We argue instead that
the unexpectedly high frequency of non-CpGti disease mutations likely reflects an
ascertainment bias: of all the mutations that cause recessive lethal diseases, those
that by chance have reached higher frequencies are more likely to have been
bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
identified in medical studies and thus to have been included in this study. Beyond
thespecificapplication,thisstudyhighlightstheparameterslikelytobeimportant
inshapingthefrequenciesofMendeliandiseasealleles.
bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
AuthorSummary
What determines the frequencies of disease mutations in human populations? To
begintoanswerthisquestion,wefocusononeofthesimplestcases:mutationsthat
causecompletelyrecessive,Mendeliandisease.Wefirstreviewtheoryaboutwhatto
expect from mutation and selection in a population of finite size and further
generatepredictionsbasedonsimulationsusingarealistdemographicscenarioof
human evolution. For a highly mutable type of mutations, involving transitions at
CpG sites, we find that the predictions fit observed frequencies of recessive lethal
diseasemutationswell.Forlessmutabletypes,however,predictionstendtounderestimate the observed frequency. We discuss possible explanations for the
discrepancy and point to a complication that, to our knowledge, is not widely
appreciated: that there exists ascertainment bias in disease mutation discovery.
Specifically, we suggest that alleles that have been identified to date are likely the
onesthatbychancehavedriftedtohigherfrequenciesandarethusmorelikelyto
have been mapped. More generally, our study highlights the parameters that
influence the frequencies of Mendelian disease alleles, helping to interpret the
relevanceofvariantsofunknownsignificancebasedontheirallelefrequencies.
bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
Introduction
New disease mutations arise in heterozygotes and either drift to higher
frequenciesorarerapidlypurgedfromthepopulation,dependingonthestrengthof
selection and the demographic history of the population [1-6]. Elucidating the
relative contributions of mutation, natural selection and genetic drift will help to
understand why disease alleles persist in humans. Answers to these questions are
alsoofpracticalimportance,ininforminghowgeneticvariationdatacanbeusedto
identifyadditionaldiseasemutations[7].
In this regard, rare, Mendelian diseases, which are caused by single highly
penetrant and deleterious alleles, are perhaps most amenable to investigation. A
simple model for the persistence of mutations that lead to Mendelian diseases is
that their frequency is determined by a balance between their introduction by
mutationandeliminationbypurifyingselection,i.e.,thatweexpecttofindthemat
“mutation-selectionbalance”(MSB)[4].Infinitepopulations,randomdriftleadsto
stochastic changes in the frequency of any mutation, so demographic history, in
addition to mutation and natural selection, plays an important role in shaping the
frequencydistributionofdeleteriousmutations[3].
Another factor that may be important in shaping the frequency of highly
penetrantdiseasemutationsisgeneticinteractions.Forinstance,adiseasemutation
mayberescuedbyanothermutationinthesamegene[8-10]orbyamodifierlocus
elsewhere in the genome that modulates the severity of the disease symptoms or
thepenetranceofthediseaseallele(e.g.[11-13]).
bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
Forasubsetofdiseaseallelesthatarerecessive,analternativemodelfortheir
frequencyinthepopulationisthatthereisanadvantagetocarryingonecopybuta
disadvantage to carrying two or none, such that the alleles persist due to
overdominance,aformofbalancingselection.Wellknownexamplesincludesickle
cell anemia, thalassemia and G6PD deficiency in populations living where the
malaria is endemic [14]. Beyond these cases, the importance of overdominance in
maintainingthehighfrequencyofdiseasemutationsisunknown.
Here,wetestedhypothesesaboutthepersistenceofmutationsthatcauselethal,
recessive, Mendelian disorders. This case provides a good starting point towards
elucidatingthedeterminantsofdiseasemutationsinhumanpopulations,becausea
large number of recessive lethal disorders have been mapped (e.g., genes have
already been associated with >66% of Mendelian disease phenotypes; [15]) and,
whilethefitnesseffectsofmostdiseasesarehardtoestimate,forlethaldiseases,the
selectioncoefficientisclearly1forhomozygotecarriers(atleastintheabsenceof
modern medical care). Moreover, the simplest expectation for mutation-selection
balancewouldsuggestthat,givenaperbasepairmutationrateontheorderof10-8
pergeneration[16],thefrequencyofsuchalleleswouldbe u ,i.e.,~10-4[4].Thus,
samplesizesinhumangeneticsarenowsufficientlylargethatweshouldbeableto
observerecessivediseaseallelessegregatinginheterozygotes.
Tothisend,wecompiledgeneticinformationforasetof417mutationsreported
to cause fatal, recessive Mendelian diseases and estimated the frequencies of the
disease-causingallelesfromlargeexomedatasets.Wethencomparedthesedatato
theexpectedfrequenciesofdeleteriousallelesbasedonmodelsofMSBinorderto
bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
evaluate the effects of demography and other mechanisms in influencing these
frequencies.
Results
Mendelianrecessivediseasealleleset
We relied on two datasets, one that describes 173 autosomal recessive
diseases [17] and another from a genetic testing laboratory (Counsyl;
<https://www.counsyl.com/>) that includes 110 recessive diseases of clinical
interest. From these lists, we obtained a set of 44 “recessive lethal” diseases
associated with 42 genes (Table S1) requiring that at least one of the following
conditionsismet:(i)intheabsenceoftreatment,theaffectedindividualsdieofthe
diseasebeforereproductiveage,(ii)reproductioniscompletelyimpairedinpatients
ofbothsexes,(iii)thephenotypeincludesseverementalretardationthatinpractice
precludes reproduction, or (iv) the phenotype includes severely compromised
physical development, again precluding reproduction. Based on clinical genetics
datasets and the medical literature (see Methods for details), we were able to
confirmthat417SingleNucleotideVariants(SNVs)in32(ofthe42)geneshadbeen
reported as associated with the severe form of the corresponding disease with an
early-onset, and no indication of incomplete penetrance or effects in heterozygote
carriers(TableS2).Bythisapproach,weobtainedasetofmutationsforwhich,at
leastinprinciple,thereisnoheterozygoteeffect(i.e.,thedominancecoefficienth=0
inamodelwithgenotypefitnesses1,1-hsand1-s)andtheselectivecoefficientsfor
therecessivealleleis1.
bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
Alargesubsetofthesemutations(29.3%)consistsoftransitionsatCpGsites
(henceforth CpGti), which occur at a highly elevated rates (~17-fold higher)
compared to other mutation types, namely CpG transversions, and non-CpG
transitionsandtransversions[16].ThisproportionisinagreementwithCooperand
Youssoufian[18],whofoundthat~31.5%ofdiseasemutationsinexonsareCpGti.
EmpiricaldistributionofdiseaseallelesinEurope
Allele frequency data for the 417 variants were obtained from the Exome
AggregationConsortium(ExAC)for60,706individuals[19].Outofthe417variants
associatedwithputativerecessivelethaldiseases,threewerefoundhomozygousin
at least one individual in this dataset (rs35269064, p.Arg108Leu in ASS1;
rs28933375, p.Asn252Ser in PRF1; and rs113857788, p.Gln1352His in CFTR).
Available data quality information for these variants does not suggest that these
genotypes are due to calling artifacts (Table S2). Since these diseases have severe
symptomsthatleadtoearlydeathwithouttreatmentandtheseExACindividualsare
purportedly healthy (i.e., do not manifest severe diseases) [19], the reported
mutations are likely errors in pathogenicity classification or cases of incomplete
penetrance (see a similar observation for CFTR and DHCR7in [20]). We therefore
excluded them from our analyses. In addition to the mutations present in
homozygotes, we also filtered out sites that had lower coverage in ExAC (see
Methods),resultinginafinaldatasetof385variantsin32genes(TableS2).
Genotypesforasubset(N=91)ofthesemutationswerealsoavailablefora
larger sample size (76,314 individuals with self-reported European ancestry)
bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
generated by the company Counsyl (Table S3). A comparison of the allele
frequencies in this larger dataset to that of ExAC data [19] showed excellent
concordance (Wilcoxon signed-rank test for paired samples, p-value=0.34; Fig S1).
Thus, both data sets appear to accurately reflect European frequencies of these
disease alleles, despite slight differences in the populations included. In what
follows,wefocusonExAC,whichincludesagreaternumberofdiseasemutations.
Modelsofmutation-selectionbalance
To generate expectations for the frequencies of these disease mutations
under mutation-selection balance, we considered models of infinite and finite
populations of constant size [3], and conducted forward simulations using a
plausible demographic model for African and European populations [21] (see
Methodsfordetails).Inthemodels,thereisawild-typeallele(A)andadeleterious
allele (a, which could also represent a class of distinct deleterious alleles with the
same fitness effect) at each site, such that the relative fitness of individuals of
genotypesAA,Aa,oraaisgivenby:
•
wAA=1;
•
wAa=1-hs;
•
waa=1-s;
ThemutationratefromAtoaisuandtherearenobackmutations.
For a constant population of infinite size, Wright [22] showed that under
theseconditions,thereexistsastableequilibriumbetweenmutationandselection,
when the selection pressure is sufficiently strong (s>>u). In particular, when the
bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
deleteriouseffectofalleleaiscompletelyrecessive(h=0),itsequilibriumfrequency
qisgivenby:
!=
!/!.
(1)
For a finite population of constant size, Nei [3] derived the mean (eq. 2) and
variance (eq. 3) of the frequency of a fully recessive deleterious mutation (h=0)
basedonadiffusionmodel,leadingto:
q=
Γ(2 Nu + 1 / 2)
2 Ns Γ(2 Nu )
σ q2 = u / s − q 2 , ,
(2)
(3)
whereNisthediploidpopulationsizeand𝛤 is the gamma function(seeSimonsetal.
[1]forasimilarapproximation).
In a finite population, the mean frequency, q , therefore depends on
assumptionsaboutthepopulationmutationrate(2Nu).Ifthepopulationmutation
rateishigh,suchthat2Nu>>1, q isapproximatedby
q ≈ u / s ,
(4)
whichisindependentofthepopulationsizeandequaltofrequencyexpectedinan
infinite size population; in particular, for a recessive lethal mutation, the mean
frequency is equal to the right hand side of eq. (1). The important difference
betweenmodelsisthatinafinitepopulation,thereisadistributionoffrequenciesq
(becauseofgeneticdrift),ratherthanasinglevalue,whosevarianceisgivenineq.
(3).
bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
In contrast, when the finite population has a low population mutation rate
(2Nu<<1),themeanallelefrequency, q ,isapproximatedby:
q ≈ u 2πN / s ,
which depends on the population size. In humans, the mutation rate at each base
pairisverysmall(ontheorderof10-8),sothesecondapproximationshouldapply
when considering each single site independently. The expectation and variance of
thefrequencyofafatal,fullyrecessiveallele(i.e.,s=1,h=0)arethengivenby:
q = u 2π N , (6)
σ q2 = u − q 2 = u(1− 2π Nu) ≈ u . (7)
and
All models assume complete penetrance, complete recessivity (h = 0),
lethality of homozygous mutations (s = 1) and consider a single site, thereby
ignoringthepossibilityofcompoundheterozygosity(unlessotherwisenoted).
Comparingmutation-selectionbalancemodels
Althoughaninfinitepopulationsizehasoftenbeenassumedwhenmodeling
deleteriousallelefrequencies(e.g.[5,23-26]),predictionsunderthisassumptioncan
differ markedly from what is expected from models of finite population sizes,
assuming plausible parameter values for humans. For example, the long-term
estimateoftheeffectivepopulationsizefromtotalpolymorphismlevelsis~20,000
individuals (assuming a per base pair mutation rate of 1.2x10-8 [16] and diversity
levels of 0.1% [27]). In this case, the average deleterious allele frequency in the
bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
modeloffinitepopulationsizeis>25-foldlowerthanthatintheinfinitepopulation
model(Fig1).
Becausethehumanpopulationsizehasnotbeenconstant,andchangesinthe
populationsizecanaffectthefrequenciesofdeleteriousallelesinthepopulation(as
recently reviewed by Simons and Sella [2]), we also simulated mutation-selection
balance under a plausible demographic model for the population history of
European populations [21]. With our parameters, the mean allele frequency
obtainedfromthismodelwas8.1x10-6,~2-foldhigherthanexpectedforaconstant
population size model with Ne=20,000 (Fig 1). The mean frequency seen in
simulationsinsteadmatchestheexpectationforaconstantpopulationsizeof~72k
individuals (see Methods and Fig S2a). This finding is expected: the estimate of
20,000 is based on total genetic variation; assuming that most of this variation is
neutral, it therefore reflects an average timescale over millions of years. For
recessivelethalmutations,however,whicharerelativelyrapidlypurgedbynatural
selection,amorerecenttimedepthisrelevant(e.g.,[1]).Indeed,inoursimulations,
mostofthediseasealleles(65.6%)segregatingatpresentaroseveryrecently,such
thattheywerenotsegregatinginthepopulation205generationsago,thetimepoint
after which Ne is estimated to have increased from 9,300 to 512,000 individuals
[21].Increasingtheeffectivepopulationsizeisnotenoughtomodeldiseasealleles
appropriately however. For example, if we compare simulation results obtained
underthemorecomplexTennessenetal.[21]demographicmodel[21]tothosefor
simulationsofaconstantpopulationsizeofNe=72,348,themeanallelefrequencies
match,butthedistributionsofallelefrequenciesaresignificantlydifferent(FigS2b).
bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
These findings thus confirm the importance of incorporating demographic history
intomodelsforunderstandingthepopulationdynamicsofdiseasealleles[5,28,29].
In what follows, we therefore test the fit of the results based on a more realistic
demographicmodel[21]totheobservedallelefrequencies.
Comparingempiricalandexpecteddistributionsofdiseasealleles
The mutation rate, u, from wild-type allele to disease allele is a critical
parameter in predicting the frequencies of a deleterious allele [4,30]. To model
disease alleles, we considered four mutation types separately, with the goal of
capturingmostofthefine-scaleheterogeneityinmutationrates[31-34]:transitions
inmethylatedCpGsites(CpGti)andthreelessmutabletypes,namelytransversions
inCpGsites(CpGtv)andtransitionsandtransversionsoutsideaCpGsite(nonCpGti
and nonCpGtv, respectively). In order to control for the methylation status of CpG
sites, which has an important influence on transition rates [31], we excluded 12
CpGti that occurred in CpG islands, which tend not to be methylated (following
Moorjanietal.[35]).Toallowforheterogeneityinmutationrateswithineachoneof
these four classes considered, we modeled the within-class variation in mutation
ratesaccordingtoalognormaldistribution(seedetailsinMethodsand[33]).
For each mutation type, we then compared results from the simulations to
what is observed in ExAC, focusing on the largest sample of the same common
ancestry, namely Non-Finnish Europeans (N = 33,370) (Fig 2). We find significant
differencesbetweenempiricalandexpectedmeanfrequenciesfornonCpGti(51-fold
on average; two-tailed p-value < 10-3; see Methods for details) and nonCpGtv (25-
bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
fold on average, two-tailed p-value < 10-3), to a lesser extent for CpGtv (p-value =
0.05), but not for CpGti (p-value = 0.16). Intriguingly, the discrepancy between
observedandexpectedfrequenciesbecomessmallerasthemutationrateincreases
(Fig2).
Two additional factors should further decrease the expected frequencies
relative to our predictions, and will thus exacerbate the discrepancy observed
between empirical and expected distribution of deleterious alleles. First, we have
ignoredtheexistenceofcompoundheterozygosity,thecaseinwhichcombinations
oftwo distinctpathogenic allelesinthesamegeneleadtolethality.Weknow that
thisphenomenoniscommon[36],andindeed58.4%ofthe417diseasemutations
considered in this study were initially identified in compound heterozygotes.
Because of compound heterozygosity, each deleterious mutation will be selected
againstnotonlywhenpresentintwocopieswithinthesameindividual,butalsoin
the presence of certain lethal mutations at other sites of the same gene. Thus, the
frequency of a deleterious mutation will be lower under a model incorporating
compound heterozygosity than under a model that does not include this
phenomenon.
Inordertomodeltheeffectofcompoundheterozygosityinoursimulations,
wereranoursimulationsconsideringageneratherthanasite.Inthesesimulations,
we used the same setup as in the site level analysis, except for the mutation rate,
here defined as U, the sum of the mutation rates uiat each site i that is known to
cause a severe and early onset form of the disease (Table S2; see Methods for
details).Thisapproachdoesnotconsiderthecontributionofothermutationsinthe
bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
genes that cause the mild and/or late onset forms of the disease, and implicitly
assumes that all combinations of known recessive lethal alleles of the same gene
have the same fitness effect as homozygotes. Comparing observed frequencies to
thosegeneratedbysimulation,onethirdofthegenesdifferatthe5%level,witha
clear trend for empirical frequencies to be above expectation (Table S4; Fig 3;
Fisher’scombinedprobabilitytestp-value<10-14).
This finding is even more surprising than it may seem, because we do not
know the complete mutation target for each gene and are therefore ignoring the
contributionofadditionalsitesatwhichdiseasemutationscouldarise.Ifthereare
undiscoveredrecessivelethalmutationsthatcausedeathwhenformingcompound
heterozygoteswiththeascertainedones,thepurgingeffectofpurifyingselectionon
the known mutations will be under-estimated, leading us to over-estimate the
expectedfrequenciesinsimulations.Therefore,ourpredictionsare,ifanything,an
over-estimate of the expected allele frequency and the discrepancy between
predictedandtheobservedislikelyevenlargerthanwhatwefound.
The other factor that we did not consider in simulations but would reduce
the expected allele frequencies is a subtle fitness decrease in heterozygotes. To
evaluatepotentialfitnesseffectsinheterozygoteswhennonehadbeendocumented
in humans, we considered the phenotypic consequences of orthologous gene
knockoutsinmouse.Wewereabletoretrieveinformationonphenotypesforboth
homozygoteandheterozygotemiceforonlyeightoutofthe32genes,namelyASS1,
CFTR, DHCR7, NPC1, POLG, PRF1, SLC22A5, and SMPD1. For all eight, homozygote
knockoutmicepresentedsimilarphenotypesasaffectedhumans,andheterozygotes
bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
showedamilderbutdetectablephenotype(TableS5).Theextenttowhichthesame
phenomenon applies to the mutations with no clinically reported effects in
heterozygous humans is unclear, but the finding with knockout mice makes it
plausiblethatthereexistsaverysmallfitnessdecreaseinheterozygotesinhumans
as well, potentially enough to have a marked impact on the allele frequencies of
deleterious mutations, but not enough to have been recognized in clinical
investigations.Indeed,evenifthefitnesseffectinheterozygoteswereh=1%,a68%
decreaseinthemeanallelefrequencyofthediseasealleleisexpected(FigS3).
Discussion
To investigate the population genetics of human disease, we focused on
mutations that cause Mendelian, recessive disorders that lead to early death or
completely impaired reproduction. We sought to understand to what extent the
frequencies of these mutations fit the expectation based on a simple balance
between the input of mutations and the purging by purifying selection, as well as
how other mechanisms might affect these frequencies. Many studies implicitly or
explicitlycompareknowndiseaseallelefrequenciestoexpectationsfrommutationselection balance. We tested whether known disease alleles as a class fit these
expectations, and found that, under a sensible demographic model for European
populationhistorywithpurifyingselectiononlyinhomozygotes,theexpectationsfit
the observed disease allele frequencies well when the mutation rate is high (see
CpGtiinFig2).Ifmutationrateislow,however,asisthecaseforCpGtv,nonCpGti
and nonCpGtv, the mean empirical frequencies of disease alleles are above
bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
expectation(Fig2).Further,includingpossibleeffectsofthediseaseinheterozygote
carriersorfortheeffectofcompoundheterozygotesinthesemodelsonlyincreases
thediscrepancy.
Inprinciple,higherthanexpecteddiseaseallelefrequenciescouldbeexplained
by at least four (non-mutually exclusive) factors: (i) misspecification of the
demographic model, (ii) widespread errors in reporting the causal variant, (iii)
overdominance of disease alleles and (iv) low penetrance of disease mutations.
Notably, it has been estimated that population growth in Europe could have been
strongerthatweconsideredinoursimulations[37,38].Strongerpopulationgrowth
doesincreasetheexpectedfrequencyofrecessivediseasealleles(FigS4,columnsAD).However,theimpactofmoreintensegrowthislikelyinsufficienttoexplainthe
observeddiscrepancy: the allele frequencies observedinExAC are stillon average
an order of magnitude larger than expected based on a model with ten-fold more
intensegrowththanin[21](FigS4).Similarly,whileerrorsinreportingthecausal
variants are known to be common [19,39,40], we attempted to minimize them by
filtering out any case for which there was not compelling evidence of association
with a recessive lethal disease, reducing our initial set of 769 mutations to 385 in
whichwehadgreaterconfidence(seeMethodsfordetails).
The other two factors, overdominance and low penetrance, are likely
explanationsforasubsetofcases.Forinstance,CFTR,thegeneinwhichmutations
lead to cystic fibrosis, is the furthest above expectation (p-value = 0.002; Fig 3). It
waslongnotedthatthereisanunusuallyhighfrequencyoftheCFTRdeletionΔF508
in Europeans, which led to speculation that disease alleles of this gene may be
bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
subjecttoover-dominance([41],butsee[42]).Whetherornotthisisthecase,there
is evidence that disease mutations in this gene can complement one another [8,9]
and that modifier loci in other genes also influence their penetrance [9,12].
Consistent with variable penetrance, Chen et al. [20] identified three purportedly
healthyindividualscarryingtwocopiesofdiseasemutationsinthisgene.Similarly,
DHCR7, the gene associated with the Smith-Lemli-Opitz syndrome, is somewhat
above expectation in our analysis (p-value = 0.052; Fig 3) and healthy individuals
were found to be homozygous carriers of putatively lethal disease alleles in other
studies [20]. These observations make it plausible that, in a subset of cases
(particularlyforCFTR),thehighfrequencyofdeleteriousmutationsassociatedwith
recessive,lethaldiseasesareduetogeneticinteractionsthatmodifythepenetrance
ofcertainrecessivediseasemutations.
Nonetheless, two factors argue against modifiers alleles being a sufficient
explanation for the site-level analysis. First, when we remove 130 mutations in
CFTRand12inDHCR7,twogenesthatwereoutliersatthegene-level(Fig3;Table
S4) and for which healthy individuals were reported to be homozygous for a
deleterious allele [20], the discrepancy between observed and expected allele
frequenciesisbarelyimpacted(FigS5).Second,thereisnoobviousreasonwhythis
explanationwouldapplydifferentlytodistincttypesofmutations,yettheextentto
which observed allele frequencies exceed the expected depends on the mutation
rate,withnodepartureseenatCpGti(Fig2).
Instead, it seems plausible that there is likely an ascertainment bias in disease
allelediscoveryandmutationidentification.Sincenotallmutationsthatcancausea
bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
specificMendeliandiseaseareknown,thosemutationsthatwerediscoveredtodate
are likely the ones that by chance have drifted to higher frequencies, and are thus
more likely to have been mapped and be reported in the medical literature. One
clearimplicationofthatisthattherearenumeroussitesatwhichmutationscause
recessive lethal diseases yet to be discovered, particular at non-CpG sites. More
generally,thisascertainmentbiascomplicatestheinterpretationofobservedallele
frequenciesintermsoftheselectionpressuresactingondiseasealleles.
Beyondthisspecificpoint,ourstudyillustrateshowthelargesamplesizesnow
madeavailabletoresearchersinthecontextofprojectslikeExAC[19]canbeused
not only for direct discovery of disease variants, but also to test why there are
diseaseallelessegregatinginthepopulationandtounderstandatwhatfrequencies
wemightexpecttofindthem.
Methods
Diseasealleleset
Inordertoidentifysinglenucleotidevariantswithinthe42genesassociated
with lethal, recessive Mendelian diseases (Table S1), we initially relied on the
ClinVardataset[43](accessedonJune3rd,2015).Wefilteredoutanyvariantthatis
anindeloramorecomplexcopynumbervariantorthatiseverclassifiedasbenign
orlikelybenigninClinVar(whetherornotitisalsoclassifiedaspathogenicorlikely
pathogenic). By this approach, we obtained 769 SNVs described as pathogenic or
likely pathogenic. For each one of these variants, we searched the literature for
evidence that it is exclusively associated to the lethal and early onset form of the
bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
disease and was never reported as causing the mild and/or late-onset form of the
disease. We considered effects in the absence of medical treatment, as we were
interested in the selection pressures acting on the alleles over evolutionary scales
ratherthaninthelastoneortwogenerations.Variantswithmentionofincomplete
penetrance (i.e. for which homozygotes were not always affected) or with known
effects in heterozygote carriers were removed from the analysis. This process
yielded 417 SNVs in 32 genes associated with distinct Mendelian recessive lethal
disorders (Table S2). Although these mutations were purportedly associated with
complete recessive diseases, we sought to examine whether there would be
possible, unreported effects in heterozygous carriers. To this end, we used the
Mouse Genome Database (MGD) [44] (accessed July 29th, 2015) and were able to
retrieveinformationforbothhomozygoteandheterozygotemiceforeightoutofthe
32genes(allofwhichwithahomologueinmice)(TableS5).
In addition to the information provided by ClinVar for each one of these
variants,weconsideredtheimmediatesequencecontextofeachSNV,totailorthe
mutationrateestimateaccordingly[16].Fordoingso,weusedanin-housePython
script and the human genome reference sequence hg19 from UCSC
(<http://hgdownload.soe.ucsc.edu/goldenPath/hg19/chromosomes/>).
Geneticdatasets
TheExomeAggregationConsortium(ExAC)[19]wasaccessedonAugust9th,
2016. The data consist of genotype frequencies for 60,706 individuals, assigned
bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
afterPCAanalysistooneofsevenpopulationlabels:African(N=10,406),EastAsian
(N=8,654), Finnish (N=6,614), Latino (N=11,578), Non-Finnish European
(N=66,740),SouthAsian(N=16,512)and“other”(N=908).Wefocusedouranalyses
onthoseindividualsofNon-FinnishEuropeandescent,becausetheyconstitutethe
largest sample size from a single ancestry group. We note that, some diseases
mutations,forinstance,thoseinASPA,HEXAandSMPD1,areknowntobeespecially
prevalent in Ashkenazi Jewish populations, and that could potentially bias our
resultsifAshkenaziJewishindividualsconstitutedagreatportionofthesamplewe
considered. However, this sample includes only ~2,000 (~3%) Ashkenazi
individuals(Dr.DanielMacArthur,personalcommunication).
Fromtheinitial417mutations,wefilteredoutthreethatwerehomozygous
in at least one individual in ExAC and 29 that had lower coverage, i.e., fewer than
80%oftheindividualsweresequencedtoatleast15x.Thisapproachleftuswitha
set of 385 mutations with a minimum coverage of 27x per sample and an average
coverage of 69x per sample (Table S2). For 248 sites with non-zero sample
frequencies, ExAC reported the number of non-Finnish European individuals that
weresequenced,whichwasonaverage32,881individuals[19].Fortheremaining
137 sites, we did not have this information. Nonetheless, the mean coverage is
reportedforallsitesanddoesnotdifferbetweenthetwosetsofsites(FigS6).We
thereforeassumedthatmeannumberofindividualscoveredforallsiteswas32,881
[45] and used this number to obtain sample frequencies from simulations, as
explainedbelow.
bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
A
second
genetic
dataset
was
obtained
from
Counsyl
(<https://www.counsyl.com/>). Counsyl is a commercial genetic screening
laboratory that offers, among other products, the “Family Prep Screen”, a genetic
screening test intended to detect carrier status for up to 110 recessive Mendelian
diseases in couples that are planning to have a child. A subset of 294,000 of its
customers was surveyed by genotyping or sequencing for “routine carrier
screening”.Thissubsetexcludesindividualswithindicationsfortestingbecauseof
known personal or family history of Mendelian diseases, infertility, and
consanguinity.Itthereforerepresentsamorerandom(withregardtothepresence
ofdiseasealleles),population-basedsurvey.Fortheseindividuals,wehaddetailson
self-reported ancestry (14 distinct ethnic/ancestry/geographic groups) and the
allele frequencies for 98 mutations that match those that passed our variant
selectioncriteriadescribedabove,ofwhich92arealsosequencedtohighcoverage
in the ExAC database (Table S2). We focused our analysis of this dataset on the
76,314individualswithself-reportedNorthernorSouthernEuropeanancestry.
Simulatingtheevolutionofdiseasealleleswithpopulationsizechange
Wealsomodeledthefrequencyofadeleteriousalleleinhumanpopulations
by forward simulations based on a crude but plausible demographic model for
humanpopulationsfromAfricaandEurope,inferredfromexomedataforAfricanAmericansandEuropean-Americans[21].Tothisend,weusedaprogramdescribed
in[1].Inbrief,thedemographicscenarioconsistsofanOut-of-Africademographic
model,withchangesinpopulationsizethroughoutthepopulationhistory,including
bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
aseverebottleneckinEuropeansfollowingthesplitfromtheAfricanpopulationand
arapid,recentpopulationgrowthinbothpopulations[21].AsinSimonsetal.[1],
wesimulatedgeneticdriftandtwo-waygeneflowbetweenAfricansandEuropeans
inrecenthistory.Negativeselectionactingonasinglebi-allelicsitewasmodeledas
intheanalyticmodels.
Allele frequencies follow a Wright-Fisher sampling scheme in each
generation according to these viabilities, with migration rate and population sizes
varying according to the demographic scenario considered. Whenever a
demographicevent(e.g.growth)alteredthenumberofindividualsandtheresulting
numberwasnotaninteger,weroundedittothenearestinteger,asinSimonsetal.
[1].Aburn-inperiodof10NegenerationswithconstantpopulationsizeNe=7,310
individuals was implemented in order to ensure an equilibrium distribution of
segregatingallelesattheonsetofdemographicchangesinAfrica,148Kya.
In contrast to Simons et al. [1], our simulations always start with the
ancestral allele A fixed and mutation occurs exclusively from this allele to the
deleteriousone(a),i.e.amutationoccurswithmeanprobabilityupergamete,per
generation,andthereisnoback-mutation.However,recurrentmutationsatasite
areallowed,asinSimonsetal.[1].
When implementing the model, we used mean mutation rates u estimated
fromalargehumanpedigreestudy[16],consideringthegenomicaverage(1.2x10-8
perbasepair,pergeneration)andfourdistinctmutationtypes(CpGti=1.12x10-7;
CpGtv=9.59x10-9;nonCpGti=6.18x10-9;andnonCpGtv=3.76x10-9).Whilethese
four categories capture much of the variation in germline mutation rates across
bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
sites,anumberofotherfactors(e.g.,thelargersequencecontextorthereplication
timing) also influence mutation rates, introducing heterogeneity in the mutation
ratewithineachclassconsidered[31-33,46].Toallowforthisheterogeneityaswell
asforuncertaintyinthepointmutationratesestimates,ineachsimulation,instead
ofusingafixedrateuforeachmutationtype,wedrewthemutationrateMfroma
lognormaldistributionwiththefollowingparameters:
σ2
log10 M | u ~ N(log10 u −
ln(10), σ 2 ) 2
(7)
suchthatthatE[M]=u.σwassetto0.57(following[33]).
By this procedure, we ran two million simulations for each mutation type,
thus obtaining the distribution of deleterious allele frequencies expected for the
Europeanpopulation.Inordertocomparesimulationresultstotheempiricaldata,
we subsampled the simulated population to match the average number of
autosomal chromosomes in the non-Finnish European sample from ExAC (N =
65,762chromosomes).
Tomeasurethesignificanceofthedeviationbetweenobservedandexpected
allelefrequencies,weproceededasfollows:First,wesampledKallelefrequencies
from the 2M simulations implemented for each mutation type, where K is the
number of mutations described for that type. We repeated this step 1,000 times,
thus obtaining a distribution for the mean allele frequency across K mutations.
Finally,toobtainatwo-tailedp-value,weconsideredtherankoftheempiricalmean
relativetosimulatedoutcomes.
bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
Awell-knownsourceofheterogeneityinmutationratewithintheCpGticlass
ismethylationstatus,withahightransitionrateseenonlyatmethylatedCpGs[18].
In our analyses, we tried to control for the methylation status of CpG sites by
excluding sites located in CpG islands (CGIs). The CGI annotation for hg19 was
obtained
from
UCSC
Genome
Browser
(track
“Unmasked
CpG”;
<http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/cpgIslandExtUnmas
ked.txt.gz>, accessed in June 6th, 2016) and BEDTools [47] was used to exclude
thoseCpGsiteslocatedinCGIs.WenotethattheCpGtiestimatefrom[16]includes
CGIs,andinthatsensetheaveragemutationratethatweareusingforCpGtimaybe
averyslightunderestimateofthemeanratefortransitionsatmethylatedCpGsites.
Unlessotherwisenoted,theexpectationassumesfullyrecessive,lethalalleles
with complete penetrance. Notably, by calculating the expected frequency one site
at a time, we are ignoring possible interaction between genes (i.e., effects of the
genetic background) and among different mutations within a gene (i.e., compound
heterozygotes).Theseassumptionsarerelaxedintwoways.Inoneanalysis(FigS3),
we consider a very low selective effect in heterozygous individuals (h = 1%),
reasoningthatsuchaneffectcouldplausiblygoundetectedinmedicalexaminations
andyetwouldnonethelessimpactthefrequencyofthediseaseallele.Second,when
considering the gene-level analysis (Fig 3), we implicitly allow for compound
heterozygositybetweenanypairoflethalmutations.Forthisanalysis,weran1000
simulationsforatotalmutationrateUpergenethatwascalculatedaccountingfor
theheterogeneityanduncertaintyinthemutationratesestimatesasfollows:(i)For
sitesknowntocausearecessivelethaldiseaseandthatpassedourfilteringcriteria
bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
(TableS2),wedrewamutationrateuifromthelognormaldistribution,asdescribed
above;(ii)WethentookthesumofuiasthetotalmutationrateU;(iii)Wethenran
onereplicatewithUasthemutationparameter,andotherparametersasspecified
forsitelevelanalysis.Becausethemutationaltargetsizeconsideredinsimulations
is only comprised of those sites at which mutations are known to cause a lethal
recessivedisease,itisalmostcertainlyanunderestimateofthetruemutationrate—
potentially by a lot. We note further that by this approach, we are assuming that
compoundheterozygotesformedbyanytwolethalalleleshavefitnesszero,i.e.,that
they are identical in their effects to homozygotes for any of the lethal alleles.
Moreover, we are implicitly ignoring the possibility of complementation, which is
(somewhat) justified by our focus on mutations with severe effects and complete
penetrance (but see Discussion). Since we were interested in understanding the
effect of compound heterozygosity, for this analysis, we did not consider the five
genesinwhichonlyonemutationpassedourfilters(BCS1L,FKTN,LAMA3,PLA3G6,
andTCIRG1).
AllcodesanddatatogeneratethefiguresinR[48]andthescriptusedtoget
the sequence context of each mutation (kindly provided by Ellen Leffler) are
available at https://github.com/cegamorim/PopGenHumDisease. The code to run
the simulations is available at https://github.com/sellalab/PopGenHumDisease.
Allelefrequenciesandotherinformationforthediseasemutationsemployedinthe
analysesareinTablesS2andS3.
bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
Acknowledgements
We thank Dr. Daniel MacArthur for providing information on accessing ExAC
information. CEGA was partially funded by a Science Without Borders fellowship
fromCAPESfoundation(BEX8279/11-0)andCNPq(PDE201145/2015-4),Brazil.
ZGwaspartiallysupportedbyapostdoctoralfellowshipfundedbyStanfordCenter
forComputational,EvolutionaryandHumanGenomics.JFDwasfundedbyaScience
WithoutBordersfellowshipfromCAPESfoundation(88888.038761/2013-00).The
work was partially supported by a Research Initiative in Science and Engineering
grant from Columbia University to JKP and MP. The computing in this project was
supported by two National Institutes of Health instrumentation grants
(S10OD012351andS10OD021764)receivedbytheDepartmentofSystemsBiology
atColumbiaUniversity.Thefundershadnoroleinstudydesign,datacollectionand
analysis,decisiontopublish,orpreparationofthemanuscript.
AuthorContributions
Conceivedthestudy:JP,MP.Designedthestudy:CEGA,ZG,MP.Analyzedthedata:
CEGA.Implementedanalyticalmodels:ZG.Wrotethepaper:CEGA,ZG,MP.Helped
inacquisitionandanalysisofdata:ZB,JFD.Contributedanalyticaltoolsordata:YBS,
IR.
bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
References
1. Simons YB, Turchin MC, Pritchard JK, Sella G (2014) The deleterious mutation
loadisinsensitivetorecentpopulationhistory.NatGenet46:220-224.
2. Simons YB, Sella G (2016) The impact of recent population history on the
deleterious mutation load in humans and close evolutionary relatives.
bioRxiv.
3. Nei M (1968) The frequency distribution of lethal chromosomes in finite
populations.ProcNatlAcadSciUSA60:517-524.
4. Gillespie JH (2004) Population Genetics: A Concise Guide. Baltimore, MD: Johns
HopkinsUniversityPress.
5. Brandvain Y, Wright SI (2016) The Limits of Natural Selection in a
NonequilibriumWorld.TrendsGenet32:201-210.
6.BalickDJ,DoR,CassaCA,ReichD,SunyaevSR(2015)DominanceofDeleterious
AllelesControlstheResponsetoaPopulationBottleneck.
7.BeauchampKA,MuzzeyD,WongKK,HoganGJ,KarimiK,etal.(2016)Systematic
DesignandComparisonofExpandedCarrierScreeningPanels.bioRxiv.
8. Cormet-Boyaka E, Jablonsky M, Naren AP, Jackson PL, Muccio DD, et al. (2004)
Rescuing cystic fibrosis transmembrane conductance regulator (CFTR)processingmutantsbytranscomplementation.ProcNatlAcadSciUSA101:
8221-8226.
9.RapinoD,SabirzhanovaI,Lopes-PachecoM,GroverR,GugginoWB,etal.(2015)
Rescue of NBD2 mutants N1303K and S1235R of CFTR by small-molecule
correctorsandtranscomplementation.PLoSOne10:e0119796.
10. Andressoo JO, Jans J, de Wit J, Coin F, Hoogstraten D, et al. (2006) Rescue of
progeriaintrichothiodystrophybyhomozygouslethalXpdalleles.PLoSBiol
4:e322.
11.GallatiS(2014)Disease-modifyinggenesandmonogenicdisorders:experience
incysticfibrosis.ApplClinGenet7:133-146.
12.CorvolH,BlackmanSM,BoellePY,GallinsPJ,PaceRG,etal.(2015)Genome-wide
associationmeta-analysisidentifiesfivemodifierlocioflungdiseaseseverity
incysticfibrosis.NatCommun6:8382.
13.HabaraA,SteinbergMH(2016)Minireview:Geneticbasisofheterogeneityand
severityinsicklecelldisease.ExpBiolMed(Maywood)241:689-696.
14. Hedrick PW (2011) Population genetics of malaria resistance in humans.
Heredity(Edinb)107:283-304.
15.ChongJX,BuckinghamKJ,JhangianiSN,BoehmC,SobreiraN,etal.(2015)The
Genetic Basis of Mendelian Phenotypes: Discoveries, Challenges, and
Opportunities.AmJHumGenet97:199-215.
16.KongA,FriggeML,MassonG,BesenbacherS,SulemP,etal.(2012)Rateofde
novo mutations and the importance of father's age to disease risk. Nature
488:471-475.
17. Turner TN, Douville C, Kim D, Stenson PD, Cooper DN, et al. (2015) Proteins
linked to autosomal dominant and autosomal recessive disorders harbor
characteristicraremissensemutationdistributionpatterns.HumMolGenet.
bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
18. Cooper DN, Youssoufian H (1988) The CpG dinucleotide and human genetic
disease.HumGenet78:151-155.
19.LekM,KarczewskiKJ,MinikelEV,SamochaKE,BanksE,etal.(2016)Analysisof
protein-codinggeneticvariationin60,706humans.Nature536:285-291.
20.ChenR,ShiL,HakenbergJ,NaughtonB,SklarP,etal.(2016)Analysisof589,306
genomes identifies individuals resilient to severe Mendelian childhood
diseases.NatBiotechnol34:531-538.
21.TennessenJA,BighamAW,O'ConnorTD,FuW,KennyEE,etal.(2012)Evolution
and functional impact of rare coding variation from deep sequencing of
humanexomes.Science337:64-69.
22.WrightS(1937)TheDistributionofGeneFrequenciesinPopulations.ProcNatl
AcadSciUSA23:307-320.
23. Reich DE, Lander ES (2001) On the allelic spectrum of human disease. Trends
Genet17:502-510.
24. Pritchard JK, Cox NJ (2002) The allelic architecture of human disease genes:
commondisease-commonvariant...ornot?HumMolGenet11:2417-2423.
25. Zwick ME, Cutler DJ, Chakravarti A (2000) Patterns of genetic variation in
Mendelianandcomplextraits.AnnuRevGenomicsHumGenet1:387-407.
26.CassaCA,WeghornD,BalickDJ,JordanDM,NusinowD,etal.(2016)Estimating
the Selective Effect of Heterozygous Protein Truncating Variants from
HumanExomeData.bioRxiv.
27.TheGenomesProjectC(2015)Aglobalreferenceforhumangeneticvariation.
Nature526:68-74.
28. Lohmueller KE, Indap AR, Schmidt S, Boyko AR, Hernandez RD, et al. (2008)
Proportionally more deleterious genetic variation in European than in
Africanpopulations.Nature451:994-997.
29.PeischlS,DupanloupI,BosshardL,ExcoffierL(2016)Geneticsurfinginhuman
populations:fromgenestogenomes.CurrOpinGenetDev41:53-61.
30. Pritchard JK, Cox NJ (2002) The allelic architecture of human disease genes:
commondisease-commonvariant...ornot?
31. Segurel L, Wyman MJ, Przeworski M (2014) Determinants of mutation rate
variationinthehumangermline.AnnuRevGenomicsHumGenet15:47-70.
32. Aggarwala V, Voight BF (2016) An expanded sequence context model broadly
explains variability in polymorphism levels across the human genome. Nat
Genet48:349-355.
33.HarpakA,BhaskarA,PritchardJK(2016)Effectsofvariablemutationratesand
epistasisonthedistributionofallelefrequenciesinhumans.bioRxiv.
34.HodgkinsonA,Eyre-WalkerAVariationinthemutationrateacrossmammalian
genomes.
35. Moorjani P, Amorim CE, Arndt PF, Przeworski M (2016) Variation in the
molecularclockofprimates.ProcNatlAcadSciUSA113:10607-10612.
36. Kamphans T, Sabri P, Zhu N, Heinrich V, Mundlos S, et al. (2013) Filtering for
compound heterozygous sequence variants in non-consanguineous
pedigrees.PLoSOne8:e70151.
bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
37. Gazave E, Ma L, Chang D, Coventry A, Gao F, et al. (2014) Neutral genomic
regions refine models of recent rapid human population growth. Proc Natl
AcadSciUSA111:757-762.
38. Gao F, Keinan A (2016) Explosive genetic evidence for explosive human
populationgrowth.CurrOpinGenetDev41:130-139.
39.XueY,ChenY,AyubQ,HuangN,BallEV,etal.(2012)Deleterious-anddiseaseallele prevalence in healthy individuals: insights from current predictions,
mutationdatabases,andpopulation-scaleresequencing.AmJHumGenet91:
1022-1032.
40.PitonA,RedinC,MandelJL(2013)XLID-causingmutationsandassociatedgenes
challenged in light of data from large-scale human exome sequencing. Am J
HumGenet93:368-383.
41.GabrielSE,BrigmanKN,KollerBH,BoucherRC,StuttsMJ(1994)Cysticfibrosis
heterozygote resistance to cholera toxin in the cystic fibrosis mouse model.
Science266:107-109.
42. Quinton PM (1994) Human genetics. What is good about cystic fibrosis? Curr
Biol4:742-743.
43. Landrum MJ, Lee JM, Riley GR, Jang W, Rubinstein WS, et al. (2014) ClinVar:
public archive of relationships among sequence variation and human
phenotype.Nucleicacidsresearch42:D980-D985.
44.EppigJT,BlakeJA,BultCJ,KadinJA,RichardsonJE(2015)TheMouseGenome
Database (MGD): facilitating mouse as a model for human biology and
disease.NucleicAcidsRes43:D726-736.
45.HaqueIS,LazarinGA,KangH,EvansEA,GoldbergJD,etal.(2016)MOdeledfetal
riskofgeneticdiseasesidentifiedbyexpandedcarrierscreening.JAMA316:
734-742.
46. Mugal CF, Ellegren H (2011) Substitution rate variation at human CpG sites
correlates with non-CpG divergence, methylation level and GC content.
GenomeBiol12:R58.
47.QuinlanAR,HallIM(2010)BEDTools:aflexiblesuiteofutilitiesforcomparing
genomicfeatures.Bioinformatics26:841-842.
48.RCoreTeam(2015)R:ALanguageandEnvironmentforStatisticalComputing.
Vienna,Austria.
49. Online Mendelian Inheritance in Man O (2016). Baltimore, MD: McKusickNathansInstituteofGeneticMedicine,JohnsHopkinsUniversity.
1e+04
1e+02
1e+03
Density
1e+05
1e+06
bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
0
−6
−5
−4
−3
Log10(Allele frequency)
Fig 1. Expected allele frequencies in the population, under a mutationselection balance models. Blue bar denotes the expected under an infinite
populationsize,agreenbarunderafiniteconstantpopulation,andaredbarunder
aplausibledemographicmodelforEuropeanpopulations(distributionshowninthe
greyhistogram).Allmodelsassumes=1andh=0.Forthefiniteconstantpopulation
sizemodel,wepresentthemeanfrequencyexpectedforapopulationsizeof20,000
(seeFigS2afortheeffectofvaryingthepopulationsize).Sampleallelefrequencies
(q) were transformed to log10(q) and those q=0 were set to 10-7 for visual
purposes,butindicatedas“0”ontheX-axis.Y-axisisplottedonalog-scale.
bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
−5
−4
−3
−2
SIM
−6
81%
−
0
−
Log10(Frequency)
−3
−4
−5
−
−6
0
Log10(Frequency)
−2
−1
CpGtv (n=13)
p−value=0.05
−1
CpGti (n=101)
p−value=0.16
98%
33%
ExAC
SIM
ExAC
−1
ExAC
−3
−2
SIM
−4
−5
70%
−6
98%
−
−
0
−
Log10(Frequency)
−2
−3
−6
−5
−4
−
0
Log10(Frequency)
54%
nonCpGtv (n=104)
p−value<1e−3
−1
nonCpGti (n=155)
p−value<1e−3
−
99%
59%
SIM
ExAC
Fig 2. Expected and observed distributions of disease allele frequencies. The
four panels correspond to four different mutation types. The title of the panel
indicatesthemutationtype,followedbythetotalnumberofmutations,withp-value
for the difference between observed and expected mean frequencies below it.
Results in grey (SIM) were obtained from simulations considering a plausible
demographicmodelforEuropeanpopulations[21](seetext).Theobservedvalues
estimated from 33,370 individuals of European ancestry from ExAC are shown in
white. Violin plots show the density distribution of variable sites, while boxes
indicatetheproportionofsitesforwhichthewild,non-deleteriousmutationisfixed.
Means (considering both segregating and fixed mutations) are indicated with red
horizontalbars.
−−−
−2
−3
−
− −−
−4
−
−5
− −
− −
−
−
−
−
−
− −
−
− −
− −
−
−
100%
75%
50%
25%
0
FA
H
LA
M
B
ER 3
C
C
8
AS
PA
H
SD
17
B
PP 4
T1
D
H
C
R
7
PE
X7
PR
F1
AS
SM S1
AR
C
AL
N
PC 1
1
G
BE
1
G
AL
C
TK
2
ST
AR
C
LN
5
TP
P1
G
AA
SL
C
2
PO 2A
M 5
G
N
T1
G
AN
H
EX
A
PD
PO 1
LG
UA
SM
ID
C
FT
R
0
Log10(Frequency)
−
FigS3.Diseaseallelefrequenciesatthegenelevel.Theexpectation(grey)isbasedon1000simulations,assumingnoeffect
in heterozygotes, but allowing for compound heterozygosity (see Methods for details). Observed frequencies (purple bars)
wereobtainedfromExACconsidering33,370Europeanindividuals.Genesareorderedaccordingtothetwo-tailedp-valuefor
thesignificanceofthedeviationoftheempiricalfrequenciesfromtheexpected(TableS4).Tocalculatethat,weconsideredthe
rankoftheempiricalmeanrelativetothesimulations.Violinplotsshowthedistributionofsimulatedallelefrequenciesamong
segregating alleles and boxes represent the fraction of simulations in which no deleterious allele was observed in the
simulatedsampleatpresent.
TableS5.Phenotypiceffectofmouseknock-outs(seemaintext)
Gene
ASS1
Humandisease
Citrullinemia
OMIM
number
Phenotypeofaffectedhuman
casesa
215700
Veryhighconcentrationofthe
amino-acidcitrullineinserum,
spinalfluid,andurine.
219700
Disruptionofexocrinefunctionof
thepancreas,intestinalglands
(meconiumileus),biliarytree
(biliarycirrhosis),bronchialglands
(chronicbronchopulmonary
infectionwithemphysema)and
sweatglands(highsweatelectrolyte
withdepletioninahot
environment).Infertilityoccursin
malesandfemales.
CFTR
Cysticfibrosis
DHCR7
Smith-Lemli-Opitz
syndrome
270400
Multiplecongenitalmalformation
andmentalretardationsyndrome.
NPC1
Niemann-Pick
disease,typeC1
257220
Lipidstoragedisordercharacterized
byprogressiveneurodegeneration.
Phenotypeofhomozygous
knockoutmiceb
Completeneonatallethality,
abnormalcirculatingamino-acid
level,increasedcirculating
ammonialevel.
Partialpostnatallethality,
aphagia,pancreaticacinarcell
atrophy,abnormalintestine
morphology,abnormaldigestive
systemphysiology,abnormal
glandmorphology,acute
pancreasinflammation,weight
loss,distendedabdomen,
abnormalionhomeostasis,
enlargedgallbladder,abnormal
respiratorysystemphysiology,
lacrimalglandatrophy.
Completeneonatallethality,
abnormalsucklingbehavior,
weakness,abnormalnasal
cavitymorphology,fetalgrowth
retardation,cyanosis,abnormal
braindevelopment,distended
urinarybladder.
Prematuredeath,abnormal
Purkinjecellmorphology,
increasedbraincholesterol
level,increasedlivercholesterol
level,abnormalmacrophage
morphology,abnormal
microglialcellactivation,
abnormallipidhomeostasis,
decreasedbodyweight,
Phenotypeof
heterozygousknockout
miceb
Abnormalcirculating
amino-acidlevel.
Impairedfertilization,
decreasedlittersize.
Abnormalcholesterollevel,
syndactyly,partial
embryoniclethality,
decreasedbrainsize.
Increasedbraincholesterol
level.
POLG
Alperssyndrome
203700
PRF1
Hemophagocytic
603553
lymphohistiocytosis
SLC22A5
Carnitinedeficiency 212140
SMPD1
Niemann-Pick
disease,typeA
257200
impairedcoordination.
Clinicaltriadofpsychomotor
Prematuredeath,abnormal
retardation,intractableepilepsy,and mitochondrialphysiology,
liverfailureininfantsandyoung
decreasedthymocytenumber,
children.Pathologicfindingsinclude abnormallymphopoiesis,
neuronallossinthecerebralgray
macrocyticanemia,abnormal
matterwithreactiveastrocytosis
erythroidlineagecell
andlivercirrhosis.
morphology.
IncreasedactivatedTcell
number,decreasedcytotoxicT
cellcytolysis,abnormalcytokine
secretion,decreased
susceptibilitytoautoimmune
Immunedysregulationcharacterized
diabetes,increased
clinicallybyfever,edema,
susceptibilitytoviralinfection,
hepatosplenomegaly,andliver
prematuredeath,complete
dysfunction.Neurologicimpairment,
postnatallethality,liver
seizures,andataxiaarefrequent.
inflammation,CNS
inflammation,abnormal
circulatingcytokinelevel,
decreasedleukocytecell
number.
Thisresultsinimpairedfattyacid
Prematuredeath,enlargedliver,
oxidationinskeletalandheart
hepaticsteatosis,increased
muscle.Inaddition,renalwastingof
triglyceridelevel,decreased
carnitineresultsinlowserumlevels
circulatingcarnitinelevel,
anddiminishedhepaticuptakeof
impairedlipolysis,decreased
carnitinebypassivediffusion,which
bodyweight,enlargedheart.
impairsketogenesis.
Prematuredeath,ataxia,
TheclinicalphenotypefortypeA
lethargy,abnormalapoptosis,
rangesfromasevereinfantileform
decreasedbodyweight,
withneurologicdegeneration
increasedmacrophagederived
resultingindeathusuallyby3years foamcellnumber,abnormal
ofage.
lipidhomeostasis,increased
susceptibilitytobacterial
Abnormalbonemarrowcell
physiology,increasedBcell
derivedlymphoma
incidence.
Insulitis,periinsulitis,
impairednaturalkillercell
mediatedcytotoxicity.
Decreasedcirculating
carnitinelevel,impaired
lipolysis.
Abnormalimmunesystem
cellmorphology,abnormal
neurondifferentiation,
abnormaldepressionrelatedbehavior.
infection,decreasedbrainsize.
Phenotypesobtainedfrom[49]aand[44]b
−4
−
23%
20%
ExAC
Counsyl
−5
−
0
Log10(Frequency)
−3
−2
bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
FigS1.FrequencydistributionofdiseasemutationsinindividualsofEuropean
ancestry. Shown are allele frequencies for 92 variants associated with lethal,
recessive diseases, as estimated from 33,370 individuals non-Finnish, Europeanancestry in the Exome Aggregation Consortium (ExAC) database [19] and 76,314
European-ancestry individuals from a genetic testing laboratory (Counsyl) (see
Methods).Allelefrequenciesdonotdiffersignificantlybetweendatasets(Wilcoxon
signed-rank test for paired samples, p-value=0.34). Violin plots show the density
distribution of alleles segregating in these samples, while boxes indicate the
proportionofsitesforwhichthedeleteriousmutationwasnotobserved.
bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
2.0e−05
1.0e−05
0.0e+00
Deleterious allele frequency
a
0e+00
1e+05
2e+05
3e+05
4e+05
5e+05
Population size
−4
−6
0
Log10(Frequency)
−2
b
−
−
87%
98%
SIM
FIN
bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
Fig S2. Comparisons of SIM and FIN models. (A) Mean allele frequency as a
function of effective population size, under the FIN model. The X-axis range
corresponds to the minimum and maximum effective population size estimated in
[21].Theredbarindicatesthevalueofaconstantpopulationsizeatwhichthemean
allelefrequencypredictedunderaconstantpopulationsizemodelisthesameasthe
meanallelefrequencyestimatedinsimulations(SIM),foranaveragemutationrate
of 1.20x10-8 [16]. Even for this value, the mean matches the constant size
expectation,butthedistributionsdiffer(seepanel“b”below).(B)SIMreferstothe
complexdemographicscenarioinferredbyTennessenetal.[21]fortheevolutionof
European populations (see Methods). In the finite, constant size population model
(FIN),Nissetto72,348individuals,sothatthemeanallelefrequency(redbars)is
thesameasinsimulations(seepanel“a”above).Togenerateadistributionforthis
model, we simulated a constant population size, as described in Methods for the
Tennessen et al. model. Violin plots show the distribution of allele frequencies
amongsegregatingalleles,whiletheboxesindicatetheproportionofsitesatwhich
the non-disease allele is fixed. These models assume strong selection (s=1) and
completerecessivity(h=0).Ascanbeseen,thetwodistributionsdiffersignificantly
(Kolmogorov-Smirnovtest,p-value<10-15).
−4
−5
−
−
0
−6
Log10(Frequency)
−3
−2
−1
bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
87%
94%
h=0
h = 1%
Fig S3. The impact on disease allele frequencies of a small fitness effect in
heterozygotes (h=0.01). Shown is the distribution generated from simulations.
Means are represented by red horizontal bars. Violin plots show the density
distribution of alleles segregating in these samples, whereas boxes indicate the
proportion of sites for which the deleterious mutation was not observed. When a
small fitness effect in heterozygotes is considered in the simulations, the mean
decreases by 68% and a larger proportion of sites are not segregating. The two
distributionsdiffersignificantly(p-value<10-15byaKolmogorov-Smirnovtest).
−4
−5
−
−
−
−
−
0
Log10(Frequency)
−3
−2
bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
94%
96%
95%
88%
55%
A
B
C
D
ExAC
FigS4.EffectofvaryingtheendpopulationsizeintheSIMmodel.Tennessenet
al. [21] inferred the present effective population size of Europeans to be 512,000
individuals (column A). We considered the effect of larger population sizes (2-, 4-,
and 10-fold increase, denoted by columns B, C and D respectively), keeping other
parameters the same. The observed allele frequency distribution of 385 disease
mutationsinExACisshowninwhite.Violinplotsshowthedensitydistributionof
allelessegregatinginthesesamples,whereasboxesindicatetheproportionofsites
for which the deleterious mutation was not observed. All distributions differ
significantly from one another (i.e., all p-values are < 10-15 by a KolmogorovSmirnovtest).
bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
−4
−3
−2
SIM
−5
Log10(Frequency)
36%
81%
−
98%
67%
SIM
ExAC
ExAC
ExAC
−3
−2
SIM
−4
−5
70%
Log10(Frequency)
98%
−
−6
−6
−
0
−5
−4
−
−
0
−3
−2
−1
nonCpGtv (n=59)
p−value<0.01
−1
nonCpGti (n=83)
p−value=0.01
Log10(Frequency)
−
−6
−
0
−3
−4
−5
−
−6
0
Log10(Frequency)
−2
−1
CpGtv (n=9)
p−value=0.02
−1
CpGti (n=80)
p−value=0.30
99%
SIM
63%
ExAC
Fig S5. Expected and observed distributions of disease allele frequencies
(excluding mutations in CFTR and DHCR7). The four panels correspond to four
differentmutationtypes.Thetitleofthepanelindicatesthemutationtype,followed
bythetotalnumberofmutations,withp-valueforthedifferencebetweenobserved
andexpectedmeanfrequenciesbelowit.Resultsingrey(SIM)wereobtainedfrom
simulations considering a plausible demographic model for European populations
[21](seetext).Theobservedvaluesestimatedfrom33,370individualsofEuropean
ancestry from ExAC are shown in white. As opposed to Fig 1, we did not include
mutationspresentintwogenes(CFTRandDHCR7)thatwereoutliersinthegenelevelanalysis(Fig3)andwerereportedelsewhere[20]tohavehealthyindividuals
found to be homozygous for a deleterious allele. As in Fig 1, violin plots show the
densitydistributionofvariablesites,whileboxesindicatetheproportionofsitesfor
which the wild, non-deleterious mutation is fixed. Means (considering both
segregatingandfixedmutations)areindicatedwithredhorizontalbars.
bioRxiv preprint first posted online Dec. 4, 2016; doi: http://dx.doi.org/10.1101/091579. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
70
50
30
Coverage
90
A
B
Fig S6. Depth of coverage for 385 mutations in ExAC known to cause lethal,
Mendeliandiseases.Boxplotsshowthemean(blackbar)andthelowerandupper
quartilesfor(A)the248siteswithnon-zerosamplefrequenciesinExAC,forwhich
the number of sequenced non-Finnish European individuals was reported (N =
32,881)and(B)forthe137sitesforwhichwedidnothavethisinformation.Since
distributions of depth of coverage are similar between datasets, we assumed that
32,881individualsweresequencedonaverageatallsites,andusedthisnumberto
subsamplesimulationstomatchthesamplesizeoftheExACdata.