Download Accurate Phasing of Pedigree Genotypes Using Whole

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
AccuratePhasingofPedigreeGenotypesUsingWholeGenomeSequenceData
BlackburnA.N.1*,KosM.Z.1,BlackburnN.B.1,2,PeraltaJ.M.1,2,StevensP.1,
LehmanD.M.2,BlondellL.1,BlangeroJ.1,GöringH.H.H.1
1SouthTexasDiabetesandObesityInstitute,SchoolofMedicine,UniversityofTexasRio
GrandeValley
2MenziesInstituteforMedicalResearch,UniversityofTazmania,Hobart,TAS,Australia
3DepartmentofMedicine,UniversityofTexasHealthScienceCenteratSanAntonio
*Correspondingauthor:
Address:
AugustBlackburnPh.D.
3463MagicDriveSuite320
SanAntonio,TX78229
Phone:210-585-9773
E-mail:[email protected]
RunningTitle:PhasingWholeGenomeSequencingDatainPedigrees
Keywords:Phasing,Pedigrees,Sequencing,LineageSpecificAlleles,PULSAR
1
bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
Abstract
Phasing,theprocessofpredictinghaplotypesfromgenotypedata,isanimportant
undertakingingeneticsandanongoingareaofresearch.Phasingmethods,andassociated
software,designedspecificallyforpedigreesareurgentlyneeded.Herewepresentanew
methodforphasinggenotypesfromwholegenomesequencingdatainpedigrees:PULSAR
(PhasingUsingLineageSpecificAlleles/Rarevariants).Themethodisbuiltupontheidea
thatallelesthatarespecifictoasinglefoundingchromosomewithinapedigree,whichwe
refertoaslineage-specificalleles,arehighlyinformativeforidentifyinghaplotypesthatare
identical-by-decentbetweenindividualswithinapedigree.Throughextensivesimulation
weassesstheperformanceofPULSARinavarietyofpedigreesizesandstructures,andwe
exploretheeffectsofgenotypingerrorsandpresenceofnon-sequencedindividualsonits
performance.IfthegenotypingerrorrateissufficientlylowPULSARcanphase>99.9%of
heterozygousgenotypeswithaswitcherrorratebelow1x10-4inpedigreeswhereall
individualsaresequenced.Wedemonstratethatthemethodishighlyaccurateand
consistentlyoutperformsthelong-rangephasingapproachusedforcomparisoninour
benchmarking.Themethodalsoholdspromiseforfixinggenotypeerrorsorimputing
missinggenotypes.Thesoftwareimplementationofthismethodisfreelyavailable.
Introduction
Haplotypes,whicharecombinationsofallelesatdifferentpolymorphicsites
occurringonthesameDNAmolecule,areimportantinthestudyofgenetics.(Tewheyetal.
2011)Haplotypesareusefulforimputationofallelesatungenotypedloci,identificationof
genomicregionssharedidentical-by-descent,genotypeerroridentificationandcorrection,
identificationofcompoundheterozygosity,andanalysisofparent-of-origineffects,among
manyothertopics.Haplotypescanalsobeusedinplaceofallelesatindividualvariablesites
2
bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
inassociationtesting.Presentlythemostpopularsequencingplatformsandassociated
softwarepackagesreportgenotypesforindividualpolymorphisms,andthushaplotypes
mustbeinferredalgorithmically.Phasing,theprocessofanalyzinggenotypestopredict
haplotypes,isthereforeanimportantundertakingingeneticsandanongoingareaof
research.(BrowningandBrowning2011)
Whilefamilystudieshavenotbeenpopularforassociation-basedgenemappingon
complextraitsrecently,thissituationischangingwiththeemerginginterestinrare
variants,whicharearguablymoreeasilyandmorepowerfullystudiedinfamilydata.Inany
case,theincreaseinthenumbersofstudysubjectsinapopulationbeingsequencedwill
necessarilyleadtoinclusionofmore–andmoreclosely–relatedindividuals,bychance.For
phasing,pedigreesprovideadditionalinformationcomparedtounrelatedindividuals
(whichareinrealitydistantlyrelated).Thedirectobservationofinheritanceofallelesfrom
onegenerationtothenext,whichispossibleinfamilies,canbeusedtoestablishphase
empirically,andhencefamiliesconceptuallyallowforhighlyaccurateestimationof
haplotypes.Thisalsomeansthatphasingoffamilydatadoesnotnecessarilyhavetobear
thecomputationalexpenseofdealingwithavastnumberofadditionalsamplesintheform
ofreferencepanels,andthatfamilydatacanbeusefulinpopulationsforwhichgood
referencepanelsarenotavailable.However,pedigreedatabringswithitsubstantialand
uniquecomputationalchallenges.Thus,computationalmethodsandready-to-usesoftware
tailor-madeforphasingrelatedindividualsinpedigreesusingwholegenomesequencedata
areneeded.
Herewedescribeanovelfastalgorithm,whichwehavenamedPULSAR(Phasing
UsingLineageSpecificAlleles/Rarevariants),thatphaseswholegenomesequencingdata
infamilies.PULSARisbuiltupontheideathatallelesthatarespecifictoasinglefounding
chromosomewithinapedigree,whichwerefertoaslineage-specificalleles(LSAs),are
3
bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
highlyinformativeforidentifyinghaplotypesthatareidentical-by-decent(IBD)between
individualswithinthepedigree.Inhumans,eachchromosomecarriesmanysuchrare
variants,whichwecannowinterrogatewithwholegenomesequencing,allowingfor
reasonablydensehaplotypemapscomprisedofLSAs,fromwhichwecanobserve
inheritanceofhaplotypesempirically.Thisinitialhaplotypemapcanthenbeexpandedto
includenon-LSAs,whichtendtobemorecommonvariants.Wedemonstratetheutilityof
PULSARanditssoftwareimplementationonsimulateddataaswellasrealwholegenome
sequencingdataandcompareitsperformancetolong-rangephasing.
Results
Descriptionofmethod
ThegeneralstepsofthePULSARalgorithmareasfollows:1)identifyallelesthatare
likelytobelineage-specific(i.e.,LSAs);2)identifyhaplotypes,theirboundaries,andtheir
inheritanceusingLSAs;3)extendtheestimatedboundariesofthesehaplotypesbasedon
theobservationthatindividualswillshareatleastonealleleatlociwheretheysharea
haplotypeIBD(i.e.,theideabehindlongrangephasing);and4)assignallelestothe
haplotypes.Ourmethodassumesthatthephysicallocationofvariantsisknown,and
optionallymakesuseofexternalinformationregardingallelefrequencies.Below,wewill
describethesestepsinmoredetail.
Identifyingputativelineage-specificalleles
ThepremiseofthePULSARalgorithmisthatwhenanalleleispresentinasingle
chromosomeamongthefoundersofagivenpedigree,thenanydirectdescendentofthat
founderalsocarryingtheallelecanbeassumedtosharethechromosomalsegment
4
bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
harboringthatlocusIBD,whichallowsforempiricalestimationoftheinheritanceof
haplotypeswithinthepedigree.Exceptionstothislogic,suchasdenovomutation,arereal
butinfrequentenoughnottounderminethepremiseoftheapproach.Sincenotallfounders
willbesequencedinmanypedigrees,aninitialchallengeisidentifyingthoseallelesthatare
specifictoasinglefoundingchromosome.WeidentifypotentialLSAsbyanalyzingthe
patternofindividualswithinapedigreecarryingagivenallele.Theimplementationofthis
approachissimple;wesearchforallelesforwhichallindividualscarryingthealleleshareat
leastonefounderwithintheknownstructureofthepedigreeunderconsideration.We
checkthatthedirectlineagebetweeneachfounderandeachpersoncarryingtheallelealso
carriestheallele(atleastforthoseindividualsthataresequenced)andthatthealleleisnot
homozygousinanyindividuals.Fornow,thealgorithmdoesnotaccommodatethepresence
ofinbreedingloops,whereinLSAscouldbehomozygousininbredindividuals.Figure1A
showsanexampleofafamilywhereindividualscarryanallelethatcannotbelineagespecificsincethecarriersdonotshareacommonfounder.Figure1Bshowsanexampleofa
familywhereindividualscarryaputativelineagespecificallele.Notethatatthispointwe
areonlyseekingtoidentifyputativeLSAs.
5
bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
Figure 1!
A.
B.
Figure1.Inheritancepatternsoflineagespecificalleles.PanelAshowsanexampleofa
patternofindividualscarryinganalleleinwhichtheallelecannotbelineage-specificsince
individualscarryingthealleledonotshareacommonfounder.PanelBshowsanexample
wheretheallelecouldbelineagespecific.Theindividualscarryingtheallelehavetwo
commonfounders.
ConstructinghaplotypesboundariesandestablishhaplotypeinheritanceusingLSAs
UsingthesetofputativeLSAsfromthepreviousstep,wethenseektoestablish
boundarieswithinwhichhaplotypesaresharedIBDbetweenasetofrelatedindividualsin
apedigree.NotethatatthispointtheputativeLSAsmayincludesomefalsepositives.
However,fornow,letusassumethatwearedealingonlywithtrueLSAs,andrelaxthis
assumptionlater.ThesetofindividualscarryingatrueLSAwillshareanalleleIBDat
nearbyloci,assumingabsenceofmeioticrecombinationbetweentheloci,mutation,or
genotypingerror.Conversely,achangeinpattern(ofwhichindividualsshareaLSA)from
6
bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
onelocustothenextindicatesarecombinationevent(s)withinthemeiosesthatgiveriseto
thehaplotypiclineage.Hence,bytrackingwhichindividualsshareneighboringLSAsalong
thechromosomewecanestablishestheboundariesofagivenhaplotype.Whiletheconcept
isstraightforward,complicationsarisebecauseatanygivenregionofthediploidgenomeall
individualscarrytwohaplotypes,onematernalandonepaternal.Therefore,itisnecessary
totracktwoseparatehaplotypessimultaneously.Weimplementthisapproachusinga
rules-basedalgorithmtoidentifythechangesinthepatternsofLSAsharingalonga
chromosome.
Ifnotallindividualsaresequencedinagivenpedigree,whichisoftenthecase,there
willlikelybesomefalsepositivesamongtheputativeLSAsidentifiedinthepreviousstep.In
otherwords,someoftheputativeLSAsareinrealitynotIBDbutmerelyidentical-by-state
(IBS).ToreducetheriskofmistakingIBSforIBD,andthusinferringwronghaplotypes,we
requireapredeterminednumberofneighboringputativeLSAstobesharedinorderto
demarcateanewhaplotype.AverylownumberofchangedLSApatternobservationsare
requiredtoinfer,withhighconfidence,thatthereisatruerecombinationeventbecauseitis
highlyunlikelythatthesameindividualswillshareputativeLSAsoveragivenregionofa
chromosomeifthegenotypesareaproductofchance(suchasbeingIBSorperhapsdueto
genotypingerror)andnottrulylineagespecific.Herewesetthisheuristicto5observations
ofneighboringputativeLSAswiththesamechangedpatternofindividualssharingthe
putativeLSAs.
Afteridentifyingthehaplotypesaprobandcarriesalongthechromosome,we
performtwoprocedurestoestablishchromosomallengthhaplotypesfromthesesmaller
haplotypesegments.First,duringthecourseofidentifyingthehaplotypesifachangeis
observedinonesetofindividuals(onehaplotypesegmentchanges)weassumethatthe
previousandnewhaplotypesegmentsareonthesamechromosomeinthepersonwhere
7
bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
thepatternhaschanged.Second,fornon-founderswithasufficientnumberofinformative
relativeswithsequencingdataweidentifyifthehaplotypeissharedwiththematernalor
paternalsideoftheproband’srelatives.Sinceeachprobandinheritsonechromosomefrom
eachparent,haplotypesthataresharedwitheitherpaternalormaternalrelativesare
assumedtobeoneitherthepaternallyormaternallyinheritedchromosome,respectively.
ExtensionofhaplotypeboundariesusingIBSallelesharing
Whilethehumangenomecontainsavastnumberofrarevariants,manyofwhich
willbelineage-specificinagivenpedigree,thedensityofLSAsnonethelesslimitsthe
precisionwithwhichtheboundariesofhaplotypescanbedetermined.Weextendthe
haplotypeboundariesobservedinaprobandwithasimpleheuristic,namelythat
individualssharingahaplotypemustshareanalleleIBD(andthusalsoIBS)ateachlocusin
thehaplotype.Thissameheuristiciscentraltotherationalebehindlong-rangephasing,and
ourimplementationissimilarinthatwesearchforopposinghomozygousgenotypes.The
primarydifferenceisthatwearenotusingthisrationaletodiscoversharedhaplotypes,
ratheronlyextendingpredeterminedhaplotypes,carriedbyknownindividuals,for
relativelysmallunresolvedsegmentsofthegenome(typically<1%ofthegenome;we
presentaninvestigationofthedensityofLSAcoveragebelow).Thishaplotypeextension
stepproceedsinbothdirectionsintotheunassignedgapbetweentwoneighboring
haplotypesidentifiedinthepreviousstep,andweextendhaplotypesonlysofarthatthere
isnooverlap.
Mappingallelestohaplotypes
Afterestablishingthehaplotypeboundariesandinheritancethroughoutthe
pedigree,wethenmapallelesforallvariantsontothesehaplotypes.Thisstepcouldbe
8
bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
integratedintothepriorsteps,butinourimplementationweassignallelestohaplotypes
(includingLSAs)asaseparate,finalstepoftheprocedure.Homozygousgenotypesare
straightforwardtomapontohaplotypes,witheachofthetwohaplotypescarryingthesame
allele;thisisdonefirst.Consideringeachvariantindependently,wethenphase
heterozygousgenotypesinindividualsforwhichatleastoneallelehasalreadybeen
mappedontooneofthetwohaplotypestheycarryatthatgenomiclocation.Werepeatthis
mappingprocessiteratively,nowincorporatingthemappedallelesfromheterozygous
genotypesfromthepreviousiteration,untilnoadditionalgenotypescanbephased.When
allindividualswithinapedigreearesequenced,andallofthehaplotypeseachindividual
carriesareknown,barringgenotypingerrorsthisprocedurecanresolvethephaseofall
combinationsofgenotypeswithexceptionofthecasewhereeveryindividualis
heterozygous,ascenariothatbecomeslesslikelyinlargerpedigrees.
PULSARconsidersevidencefrommultipleindividuals(ifmultiplepeoplecarrythe
haplotype)whenassigningallelestohaplotypes.Inthepresenceofgenotypingerrorsit
becomespossibleforthegenotypesofmultipleindividualstoprovideconflictingsupport
forwhichalleleiscarriedonagivenhaplotype.Asanexample,twoindividualssharea
haplotypebuthaveopposinghomozygousgenotypes.Intheseambiguouscases,whenmore
thanonealleleissupportedbytheinferredhaplotypesharingandindividualgenotypes,the
allelesupportedbythemajorityofindividualsisassignedtothathaplotype.Whenthe
correctallelesareassignedtobothhaplotypescarriedbyanindividual,reconstructed
genotypesfromthesehaplotypescanbeusedtocorrectgenotypingerrors.However,in
somecasesthereisnomajorityofindividualswithwhichPULSARcancorrectgenotyping
errors.Asanexample,intriosnohaplotypeissharedbetweenmorethantwopeople,and
thusnomajoritycanbeformedinordertocorrectgenotypingerrors.Thissameprincipleis
trueforhaplotypesthataresharedbetweenonlytwoorfewerindividualswithinlarger
9
bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
pedigrees,suchashaplotypessharedbetweenmarried-infoundersandonlyoneoffspring
whenthelineageisnotpassedontosubsequentgenerations.
Benchmarking
Coverageandallele-frequencydistributionoflineage-specificalleles
Aswedescribed,thePULSARalgorithmisbasedontheideathatvariantsthatare
introducedintoapedigreeviaasinglefoundingchromosome,i.e.LSAs,canbeusedas
convenienttagstotracetheinheritanceofhaplotypeswithinpedigrees.Forthisapproach
tobepractical,itiscrucialthatLSAsexistinsufficientdensityinrealisticpedigree
structures.Thus,wehaveestimatedthedegreeofLSAcoverageusingrealsequencingdata
withknownphase,namelytheXchromosomesfromtheBritish(GBR)andFinnish(FIN)
cohortsfromthe1000GenomesProject(GenomesProjectetal.2015).Ifweviewpedigree
foundersasarandompopulationsample,thenvariouskeyaspectsofLSAcoveragecan
easilybedeterminedbypermutationfordifferentpedigreestructures.SupplementalFigure
1showsthedistributionofthenumberofLSAsperMbalongtheXchromosomeasa
functionofthenumberofpedigreefoundersfortheGBRandFINcohorts.InBritish,the
densityrangesfromamedianof135.4LSAsperMbin2-founderpedigrees(medianinterLSAdistanceof874bp,withamaximumof5.14Mbwhichincludesagapincoveragecaused
bythecentromere)to14.8LSAsperMbwith15founders(mediandistanceof19.0Kb,with
amaximumof5.39Mbwhichincludesagapincoveragecausedbythecentromere).We
estimatethatapedigreewith170founderswouldhave~1.0LSAsperMb.InFinns,a
populationwithalowergeneticdiversity(primarilyduetoasmallfoundingpopulation),
therearecomparativelyfewerLSAs,amedianof133.1LSAsperMbin2-founderpedigrees
and13.4LSAsperMbwith15founders.ThetrendtofewerLSAsinlargerpedigreesmake
10
bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
sensesinceourdefinitionofLSAisbasedonasinglefoundingoccurrenceperpedigree,and
thusfewerpolymorphicsitesqualifythemorefoundersapedigreecontains.
WithregardtotheallelefrequencydistributionofLSAs,sincetheprobabilityofa
singlefoundingeventinapedigreedependsonthepopulationprevalenceofanallele,rare
allelesaremorelikelytobespecifictoasinglefoundingchromosome,andtheenrichment
forrareallelesamongallLSAswillbegreaterinpedigreeswithmorefounders.Thisiswhat
weobserve.InBritish,with2founders~13%ofLSAshaveminorallelefrequencies<5%
(medianallelefrequencyis21.7%),whereaswith15founders~91%ofLSAshaveminor
allelefrequencies<5%(medianallelefrequencyis2.2%).Theseobservationsindicatethat
MAFestimates,eitherfromthedatasetinhandorfromareferencepanel,areexpectedtobe
ausefulfilterfordecreasingthefalsepositiveratewhenidentifyingLSAsisambiguous
(perhapsduetonon-sequencedindividuals).Inanycase,theimportanttake-homemessage
withregardtoourphasingmethodisthatinhumansthereappeartobesufficientlymany
LSAstomakethemusefulasastartingpointforphasingofWGSdatainpedigrees.
Pedigreeswithcompletesequencing
WemeasuredtheperformanceofPULSARandAlphaPhase1.1inpedigreesvarying
insizeandstructureundertheidealscenarioinwhichallindividualsinthedatasetare
sequencedwithoutgenotypingerrors.Table1presentsacomparisonoftheSERandthe
percentageofheterozygousmarkersphasedforthesesimulations.Acrossallsimulations,
PULSARproducedlowerSERsandahigherpercentageofheterozygousmarkersphased
thanAlphaPhase1.1.Inthecaseofnuclearpedigreessomemarkerswereheterozygousin
allindividualsinapedigree,asituationthatisunresolvablewithoutexternaldata.
Excludingtheseunresolvablecases,theobservednumberofphasedheterozygous
genotypeswas>99.9%ofthetheoreticalupperlimitusingPULSAR.
11
bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
Effectofgenotypingerrors
WesoughttoassesstherobustnessofPULSARtogenotypingerrors.Table2
presentsacomparisonoftheswitcherrorrateandthepercentageofheterozygousmarkers
phasedforsimulationswhereinthegenotypingaccuracyis99.0%(assuming,forsimplicity,
anequalerrorrateforallvariants).Again,PULSARproducedlowerswitcherrorratesand
higherpercentageofheterozygousmarkersphasedthanAlphaPhase1.1.
Table1.CompleteandAccurateData
SwitchErrorRate
PedigreeSize
3
4
5
6
7
8
9
11
14
20
28
55
78
94
Founders
2
2
2
2
2
2
2
5
6
6
8
17
21
28
Generations
2
2
2
2
2
2
2
3
4
4
4
6
6
5
PULSAR
0
0
-5
1.04x10 -5
1.28x10 -5
1.79x10 -5
1.56x10 -5
1.68x10 -5
1.90x10 -5
4.85x10 -4
1.07x10 -5
3.43x10 -4
1.15x10 -4
7.15x10 -4
2.64x10 AlphaPhase1.1
-2
4.17x10 -2
8.46x10 -2
9.20x10 -2
8.44x10 -2
8.93x10 -2
9.28x10 -2
7.84x10 -2
5.52x10 -2
5.46x10 -2
4.10x10 -2
4.73x10 -2
5.50x10 -2
5.53x10 -2
6.26x10 PercentageofHeterozygous
GenotypesPhased
PULSAR
AlphaPhase1.1
0.8164
0.7425
0.8499
0.8193
0.9420
0.8994
0.9887
0.9312
0.9998
0.9398
0.9999
0.9514
0.9999
0.9464
0.9996
0.9794
0.9979
0.9815
0.9966
0.9846
0.9989
0.9849
0.9989
0.9747
0.9973
0.9744
0.9966
0.9696
Table2.GenotypingErrors
SwitchErrorRate
PedigreeSize
9
11
14
20
28
55
78
94
Founders
2
5
6
6
8
17
21
28
Generations
2
3
4
4
4
6
6
5
PULSAR
-3
1.74x10
-3
8.33x10
-3
8.44x10
-3
6.99x10
-3
5.95x10
-3
7.51x10
-3
9.28x10
-3
8.41x10
12
AlphaPhase1.1
-2
7.22x10
-2
6.95x10
-2
6.22x10
-2
5.33x10
-2
6.36x10
-2
6.71x10
-2
6.73x10
-2
7.60x10
PercentageofHeterozygous
GenotypesPhased
PULSAR
AlphaPhase1.1
0.9997
0.8960
0.9961
0.8800
0.9837
0.9093
0.9830
0.9316
0.9912
0.9242
0.9629
0.9149
0.9437
0.9228
0.9026
0.9091
bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
Forthecaseofnuclearpedigrees,wecreatedmultiplesimulatedgenotyping
datasetswitharangeofgenotypingaccuracies(from99.0%to100.0%).Genotypingerrors
weresimulatedfor20datasetsateachgenotypingaccuracylevelforeachpedigree.As
showninFigure2,asthegenotypingerrorrateincreases,theswitcherrorrateincreases
linearlyinthesimulatedpedigrees.Genotypingerrorshadanattenuatednegativeaffecton
switcherrorrateinthepedigreeswithmorechildren.Thereasonbeingthatinalarger
pedigree,moreindividualspotentiallysharegenomicsectionsIBDwithoneanother,
enablingPULSARtoidentifyhaplotypesmorereliablyinthepresenceofgenotypingerrors.
0.025
Figure 2!
0.015
●
●
0.010
Switch error rate
0.020
1 child
2 children
3 children
4 children
5 children
6 children
7 children
●
0.005
●
0.000
●
●
0.000
0.002
0.004
0.006
Simulated genotype error rate
0.008
0.010
Figure2.EffectofgenotypingaccuracyandIBDsharingonswitcherrorrate.Theplot
showstheswitcherrorrates(withestimatederrorbar)from20simulationsatvarying
genotypingaccuraciesinnuclearpedigreeswitharangeof1-7children,inwhicheach
pedigreewithfewerchildrenisasubsetofthepedigreeswithmorechildren.Genotyping
errorsincreasetheswitcherrorrateinapredictablemanner.IncreasedIBDsharingwithin
thepedigreeattenuatesthiseffect.
13
bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
BasedonthehaplotypesoutputbyPULSARwereconstructedgenotypesand
comparedthemtothetruegenotypes(i.e.,withoutintroducedgenotypingerrors).As
showninFigure3,theaccuracyofthereconstructedgenotypesdecreaseslinearlywiththe
simulatederrorrateacrosspedigrees.Thereconstructedgenotypesaccuraciesfromthetrio
pedigreeareverysimilartotheaccuracyofthesimulatedgenotypes,whichisexpected
becausethepedigreestructurelacksamajorityofhaplotypeswithwhichtocorrect
genotypingerrors(asoutlinedpreviously).Infact,thereconstructedgenotypeerrorrateis
slightlyhigherthanthetruegenotypeerrorrate,becausethealgorithmintroducessome
(thoughveryfew)errorswhenhaplotypeboundaryestimationisimperfect.However,
PULSARisabletocorrectgenotypesinpedigreeswith2ormorechildren,performing
betterinpedigreeswithmorechildren.Togethertheseobservationsalsodemonstratethat
increasedIBDsharingbetweenpedigreemembersimprovestheperformanceofthe
PULSARalgorithm.
14
bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
0.010
Figure 3!
●
0.006
●
0.004
●
●
0.000
0.002
Reconstructed genotype error rate
0.008
1 child
2 children
3 children
4 children
5 children
6 children
7 children
●
●
0.000
0.002
0.004
0.006
0.008
0.010
Simulated genotype error rate
Figure3.EffectofIBDsharingoncorrectinggenotypesthroughimputation.The
plotshowstheaccuracies(andassociatedestimatederror)ofgenotypes
reconstructedfromhaplotypesproducedbyPULSARfor20simulationsatvarying
genotypingaccuraciesinnuclearpedigreeswitharangeof1-7children,inwhich
eachpedigreewithfewerchildrenisasubsetofthepedigreeswithmorechildren.
Thediagonalispresenttoshowtheequivalencebetweentheaxes.IncreasedIBD
sharingimprovestheabilityofthePULSARalgorithmtocorrectgenotypesthrough
imputation.
Effectofungenotypedindividuals
Ungenotypedindividualsinapedigreeareanotherfrequentcomplicationinreal
datasets.PULSARwillnotphaseheterozygousgenotypesforindividualsingenomicregions
wheretheindividualdoesnotshareachromosomalsegmentIBDwithanothersequenced
15
bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
individual,unlessdoneerroneouslyafterfalselyinferringIBDsharing.Non-sequenced
individualscanaffectthefalsepositiverateamongputativeLSAsidentifiedbyPULSAR,
whichwillultimatelydecreasetheaccuracyofhaplotypeboundaryandsharingestimation.
Table3presentsacomparisonoftheswitcherrorrateandthepercentageof
heterozygousmarkersphasedforsimulationswhereinsomeindividualsaremissing
sequencingdata.FortheSAMAFSpedigrees,individualswithrealsequencingdatawere
givensequencingdatainthesimulation.Weperformedtwosimulationsinvolvingthe
nuclearfamily,whereinoneandbothparentswerenotsequenced.Thecomparative
estimatesforthescenarioofcompletegenotypingofeverypedigreememberarealready
showninTable1.Thenumberofungenotypedindividualsand,moreimportantly,the
locationoftheungenotypedindividualswithinthepedigreehavenotableeffectsonthe
haplotypephasingresults.Inthecaseofthenuclearfamilies,childrenmissingsequencing
datahaveonlyasmallnegativeeffectontheaccuracyofPULSAR,asthissituationdoesnot
leadtoanincreaseinthenumberoffalselyinferredLSAs,thoughthenumberof
observationsofeachhaplotypeisreducedonaverage.Childreninanuclearfamilymissing
sequencingdataareessentiallythesameasreducingthesizeofthenuclearfamilybythe
numberofunsequencedchildren.Aconsequenceisthatahigherportionofvariantswillbe
heterozygousinallsequencedindividuals(thisisonlyofpracticalimportanceifthenumber
ofsequencedindividualsinapedigreeissmall).Forthatreason,themainimpactofmissing
childreninnuclearfamiliesisaloweredpercentageoflocithatarephasedbecausePULSAR
doesnotresolvecaseswhereinallindividualsareheterozygous(seetheestimatesfrom
Table1fornuclearfamiliesofdifferentsizes).Theeffectofmissingparentsismuchgreater
becausethisscenarioleadstoambiguityinidentifyingLSAs.Inthesimulatedpedigreewith
onemissingparentthefalsepositiverateamongputativeLSAswas15.9%comparedto0%
16
bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
whenbothparentsaresequenced,andtheSERincreasedfrom1.7x10-5to4.0x10-2.Thetotal
numberofheterozygousmarkersphaseddroppedfrom99.99%to99.45%.
Mitigatingthenegativeeffectofungenotypedindividuals
Wesoughttomitigatethenegativeeffectofungenotypedindividualsontheperformanceof
PULSAR.WeinvestigatedtheextenttowhichfilteringLSAsbasedonminorallelefrequency
wouldhelp.Table3alsopresentsacomparisonoftheeffectsofmissingindividualsin
nuclearfamilieswhereinwehavepre-filteredputativeLSAsbasedonaminorallele
frequencyof<5%.Inthenuclearpedigreewith7childrenwithonenon-sequencedparent,
therateoffalselyinferredLSAswas15.9%priortofilteringand1.50%afterward.TheSER
droppedfrom4.0x10-2to1.5x10-3.Pre-filteringisnotwithoutcost,however.Thetotal
numberofputativeLSAsdroppedfrom61,160to7,850.Thetotalpercentageof
heterozygousmarkersphaseddroppedfrom99.45%to98.58%.Inthenuclearfamily,
consideringthetradeoffsbetweentheaccuracyofthephasingresults(measuredusingSER)
andthepercentageofheterozygousmarkersphased,filteringbasedonMAFperformed
better,improvingtheoverallnumberofheterozygousmarkersphasedcorrectly,whichcan
bemeasuredbytheproductoftheaccuracy(1-SER)andthepercentageofheterozygous
markersphased.
Inthelargestpedigreeinthisstudy,wherein61of94(64.9%)individualshave
sequencingdata,filteringusingMAF<5%hadonlyaminoreffectontheresults.Thefalse
positivesamongputativeLSAsdroppedfrom0.0099to0.0091,thetotalnumberofputative
LSAsdroppedfrom44889to40060,theSERwentupfrom2.54x10-3to2.98x10-3,andthe
percentageofheterozygousvariantsphaseddroppedfrom97.4%to97.3%.Inpedigrees
withmanyfounders,theputativeLSAsthatPULSARidentifiesalreadyhavelowallele
frequenciesforthemostpart,andhencefilteringata5%prevalencethresholdhaslittle
17
bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
effect.ItispossiblethatanevenlowerMAFfilterwouldbeadvantageous,thoughitisclear
thatMAFfilteringismoreadvantageousinsmallerpedigreeswithfewerfounders.
Non-sequencedindividualshaveanegativeeffectontheperformanceofthe
PULSARalgorithm,butnotinallcases.Sinceitisnotalwaysstraightforwardtoestimate
whichcasesofmissingindividualsaredetrimentaltotheeffectivenessofthePULSAR
algorithmwehavecreatedatoolthatusesaMonte-Carlogene-droppingapproachto
estimatetheexpectedpercentageofthegenomethateachindividualwillshareatleastone
haplotypeIBDwithanothersequencedindividualinthepedigree.Thistoolcanbeusedfor
aprioriestimationoftheapplicabilityofthePULSARalgorithmtoapedigreedatasetgiven
thesequencedindividualswithinthepedigree.
Table3.MissingIndividuals
Founders
Generations
(sequenced)
PedigreeSize
(sequenced)
9(7)
9(8)
11(4)
14(3)
20(3)
28(16)
55(29)
78(31)
94(61)
2(0)
2(1)
5(1)
6(0)
6(0)
8(2)
17(3)
21(5)
28(9)
2
2
3
4
4
4
6
6
5
SwitchErrorRate
PULSAR
PUSLAR
w/prefiltering
-2
-4
1.04x10
2.44x10
-2
-3
4.04x10
1.46x10
-2
-3
3.58x10
1.52x10
-5
-5
5.30x10 5.30x10
-2
-2
5.95x10
5.48x10
-2
-3
1.11x10
3.08x10
-2
-2
2.00x10
1.44x10
-3
-3
8.17x10
5.76x10
-3
-3
2.54x10
2.98x10
AlphaPhase1.1
-2
8.75x10
-2
1.41x10
-2
8.22x10
-2
5.34x10
-2
4.54x10
-2
6.35x10
-2
7.50x10
-2
7.21x10
-2
6.57x10
PercentageofHeterozygous
GenotypesPhased
PULSAR PUSLAR AlphaPhase1.1
w/prefiltering
0.8021
0.8048
0.7677
0.9945
0.9858
0.9443
0.6915
0.5473
0.6822
0.3266
0.3256
0.4128
0.6601
0.6587
0.0040
0.9944
0.9988
0.9730
0.9788
0.9609
0.8901
0.9945
0.9873
0.8989
0.9738
0.9726
0.9280
Performanceondatawithsequencingerrorsandmissingindividuals
Wethensoughttobenchmarkourapproachinrealisticdata,havingbothmissing
individualsandgenotypingerrors.Todothisweonlyexaminedgenotypesforindividuals
withwholegenomesequencingdataavailableintheSAMAFS.Thenuclearpedigreewasset
upsothatgenotypesforoneparentweremissing.Wesimulatedgenotypeswitha
conservativegenotypeaccuracyof99.0%.Mostsequencingplatformsoutperformthis
18
bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
benchmarkeasilywithappropriatereaddepth,andthus99.0%accuracyisconservative.
ResultsforthesesimulationsarepresentedinTable4.Withfewexceptions,PULSAR
outperformedAlphaPhase1.1bothinaccuracyandincompleteness.
Table4.RealisticData
SwitchErrorRate
PedigreeSize
(sequenced)
9(8)
11(4)
14(3)
20(3)
28(16)
55(29)
78(31)
94(59)
Founders
(sequenced)
2(1)
5(1)
6(0)
6(0)
8(2)
17(3)
21(5)
28(9)
Generations
2
3
4
4
4
6
6
5
PULSARw/
pre-filtering
-2
1.85x10 -2
1.85x10 -2
1.96x10 -2
7.25x10 -2
7.71x10 -2
3.83x10 -2
1.93x10
-2
1.45x10
AlphaPhase1.1
-1
1.32x10 -2
6.24x10 -2
5.71x10 *
-2
7.50x10 -2
8.38x10 -2
8.22x10
-2
7.83x10
PercentageofHeterozygous
GenotypesPhased
PULSARw/
AlphaPhase1.1
pre-filtering
0.9370
0.8879
0.5207
0.5331
0.3327
0.3166
0.6569
*
0.9617
0.9026
0.8734
0.8155
0.9274
0.8295
0.8908
0.8393
*Notphasedsuccessfully
Applicationtorealwholegenomesequencingdata
Lastly,wesoughttoinvestigatetheperformanceofPULSARwhenappliedtoreal
wholegenomesequencingdata.Todothisweusedwholegenomesequencinggenotypes
forchromosome21fromtheSanAntonioMexicanAmericanFamilyStudy,focusingonthe
samemultigenerationalpedigreesthatwehaveusedastemplatepedigreestructuresinour
simulations.Thisdatasetcontainsallproblemsanderrorsinherentwithrealworlddata,
includingungenotypedindividuals,somemissing(uncalled)genotypeswithinsequenced
individuals,andgenotypingerrors.(Grosserrorsinpedigreerelationshipshadbeen
previouslydetectedandcorrected,however,butsomelowlevelofkinshipbetween
presumedunrelatedfoundersremainspossible.)Thedownsideofrealdataisthatthetrue
state,heremainlyphasing,isunknown.Topermitamorecompletecharacterizationofthe
performanceofthePULSARalgorithmontherealdata,wecreatedartificialblanksby
maskinggenotypesrandomlyeachwithaprobabilityof1/1000,permittingustocompare
19
bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
theoriginalgenotypecallsandtheimputedgenotypesafterphasing.Wethenapplied
PULSARtothemanipulatedWGSgenotypes.Wereconstructedgenotypesfromthephased
haplotypesproducedbyPULSAR,andthencomparedthemtotheoriginalgenotypes.Table
5summarizestheresultsforthisphasingandgenotypeimputationexperiment.PULSAR
wasabletophasethevastmajorityofobservedgenotypes(>97%inallpedigrees,exceptin
oneofthesparselygenotypedpedigrees).Andthegenotypesreconstitutedfromphased
haplotypesmatchedtheobserved,unblankedgenotypeswell(>98.5%concordance).For
themaskedgenotypes,theconcordanceratebetweentheimputedgenotypesandthe
maskedgenotypesrangedfrom95.5%to98.6%acrosspedigrees,indicatingahighdegree
ofphasingaccuracyandgenotypeimputation.However,onlyafairlysmallpercentageof
themaskedgenotypeswereimputed(<50%).Therewerealsosomeactualmissing
genotypesintheoriginalsequencingdata.Thereisanotabledifferencebetweenthe
percentagesofmaskedgenotypesandtrulymissinggenotypesPULSARwasabletoimpute.
Thisdifferenceislikelyduetothenatureofhowthemissinggenotypesaredistributed.The
maskedgenotypesweredistributedrandomly,whereasthetrulymissinggenotypeswere
morelikelytobefoundatthesamemarkerinmultipleindividuals.Thiscouldbeduetoany
numberofreasons,withonereasonablehypothesisbeingthatindelsorcopynumber
variantsdisruptthediploidstateofthesemarkersinasubsetofrelatedindividuals.In
summary,PULSARphasedthevastmajorityoftheobservedgenotypesaccurately.
20
bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
Table5.RealData
Concordancerate
between
Percentage
Pedigree
genotypes
ofnonSize
Founders
Generations
reconstructed
masked
(sequenced) (sequenced)
fromphasing
genotypes
resultsandnonphased
maskedgenotypes
11(4)
14(3)
20(3)
28(16)
55(29)
78(31)
94(61)
5(1)
6(0)
6(0)
8(2)
17(3)
21(5)
28(9)
3
4
4
4
6
6
5
0.9948
1
0.9859
0.9899
0.9873
0.9943
0.9941
0.9769
0.9297
0.9950
0.9937
0.9885
0.9909
0.9858
Concordance
ratebetween
genotypes
reconstructed
fromphasing
resultsand
masked
genotypes(ie.
imputed
genotypes)
0.9813
*
0.9545
0.9798
0.9689
0.9858
0.9823
Percentageof Percentage
masked
ofmissing
genotypes
genotypes
phased/imputed imputed
0.1853
0
0.4221
0.4385
0.2450
0.3116
0.2866
0.0714
0
0.2025
0.3396
0.2074
0.2603
0.1437
*Nogenotypeswereimputed
Computationtime
Wesoughttoestimatethescalabilityofoursoftwaretolargerpedigrees.We
simulatedpedigreeswithtwoparentsandarangeofchildrenbetween1and1000.This
designallowsustoinvestigatetheeffectofincreasingtheIBDsharingwithinthepedigree.
Itappearsthatthecomputationtimecanbemodeledwellusingasecondorderpolynomial
basedonthenumberofindividualsinthepedigree.Weestimate,basedontheestimated
polynomialfunction,thatapedigreewith2parentsand1000childrenwouldtakeroughly
5.5daystorunonaMacBookProlaptopwitha2.9GHzIntelCorei7processor.Wealso
simulatedpedigreesinwhicheachgenerationconsistedofoneoffspringoftheprevious
generationandonemarried-infounder.ThispedigreedesignwaschosenbecausetheIBD
sharingwithinthepedigreedoesnotincreaseovergenerations.Wesimulatedpedigreesin
therangeof2-50generationsinthismanner.Alinearmodelappearstobethebestfitto
describetherelationshipbetweenthenumberofindividualsinthepedigreeandthetime
PULSARtooktoanalyzethegenotypescreatedineachsimulation.Weestimatethata
pedigreewith500generationsand999individualswouldtakeroughly88minutestorun.
21
bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
Thus,takentogether,theseresultsshowthatbothpedigreesizeandpedigreestructure
affectruntimeofthePULSARalgorithm,withaportionoftheincreaseinpedigreesize
scalinglinearly.
Intherunsonthesimulateddatainthelargestpedigree(94individuals),
AlphaPhase(93seconds)wasfasterthanPULSAR(938seconds)onaserverwith2.40GHz
IntelXeonCorei7CPUsrunningCentOSLinux7.Althoughcomputationtimesbecome
importantwithincreasingvolumesofdata,wefoundPULSARtobesufficientlyfastfor
practicaluseonwholegenomesequencinginthesizeofpedigreespresentedhere.
Discussion
Accuratephasinginformationisanessentialstepinmanygeneticstudies.
Nonetheless,therearecurrentlyfewtoolsdesignedspecificallytophasegenotypesina
broadrangeofpedigreesizesandstructures,withmostavailablemethodsworkingonlyin
unrelatedindividualsorinnuclearfamilies.Inthebenchmarkingstudieswehave
undertaken,PULSARoutperformedlong-rangephasing(usingthesoftwareAlphaPhase1.1),
producedlowswitcherrorrates,andphasedahighpercentageofheterozygousvariants.
Whilethemethodclearlyhasmerit,ithassomeshortcomingsandlimitations.Inour
simulationswehaveexploredtheimpactofnon-sequencedindividualsandgenotype
errors,bothofwhichareunavoidableinrealdatasetsandweakentheperformanceof
PULSAR,aswellasalternativeapproachesforphasing.Fromoursimulationsofrealistic
data,weexaminedsomepedigreeswithverysparsesequencing,suchas3outof20
individualsbeingsequenced.Evenintheseproblematicpedigrees,withmanymissing
individuals,PULSARperformedfairlywell,erringonthesideofphasingfewerheterozygous
markersratherthanphasingthemincorrectly.Thepercentageofallelesindividualsshare
IBDwithatleastoneothersequencedindividualisthecriticalcomponenttopredictingthe
22
bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
performanceofPULSAR.Iftheindividualssequencedwithinthepedigreearenotpredicted
toshareahighpercentageofallelesIBDoneshouldconsideralternativephasing
approaches,suchasthosedesignedforsingletons.
Oursimulations,usingrealisticgenotypesandchromosomalhaplotypes,showthat
PULSARisfairlyrobusttogenotypingerrors.InfactPULSARcanbeusedtocorrectsome
genotypingerrors.PULSAR“fixes”incorrectgenotypesusingarudimentarymethod,a
simplemajoritiesvoteamongindividualssharingahaplotype.Aweightingschemebasedon
theconfidenceofgenotypecallsorbasedonthenumberofreadssupportingagivenallele
callmayimprovePULSAR’sabilitytofixgenotypeerrors.
Wehavenotinvestigatedtheimpactoferrorsinthepedigreestructuresthemselves
onPULSAR(includingtheexistenceofunknownrelationshipsbetweenpedigreefounders).
Weadvocatethatpedigreerelationshipsareconfirmedorestimatedanalytically(Sunetal.
2002;SunandDimitromanolakis2014)beforeoneembarksoneffortstoestablishphase,
correctgenotypingerrors,orimputemissinggenotypes.
PULSARwasdesignedspecificallytoworkonwholegenomesequencedata,based
ontherationalethatthevastnumberofraresequencevariantsinthehumangenomewould
produceahighdensityofLSAsinpedigrees.Whilewholegenomesequencingclearlyisthe
preferredwaytocharacterizethegenome,atthepresenttimemanystudiesonlyinvolve
exomesequencingdataordenseSNPgenotyping,mainlyduetocost.Wehavenot
investigatedtheperformanceofPULSARontheseothergenotypingplatforms.SNP
genotypingpanelsarebiasedtowardscommonSNPs.Forthatreason,weanticipatethat
PULSARwouldbeoflimitedutilityforsuchdatainextendedpedigrees,whereonlyvery
fewcommonSNPswouldbeintroducedonlyonceintoagivenpedigreeandthusserveas
LSAstothealgorithm.Onsmallpedigreeswithrelativelyfewfoundersweexpectthatthe
PULSARalgorithmwouldbemuchlessimpactedandworkquitewell.
23
bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
ThereareanumberofpotentialimprovementsandextensionstoPULSARthatwe
couldpresent,butthesearebeyondthescopeofthismanuscript.However,wedidexplore
theutilityofpre-screeningvariantsupfrontbasedonallelefrequencywhenidentifying
putativeLSAs.Whenpedigreefoundersarenotavailableforsequencing,whichwilloftenbe
thecaseinreality,thensuchfilteringbasedonminorallelefrequencywasshowntobea
usefulprocedureforreducingthenumberoffalsepositivesamongputativeLSAsandthus
improvetheperformanceofthemethod.However,allelefrequencyestimatesshouldbe
reliabletobeeffectiveaspre-filters.Theycanbetakeneitherfromthestudyathand,if
therearesufficientlymanyfounders,ortakenfromreferencepanels.Ifonetakesthe
estimatesfromreferencepanels,oneshouldtakecarethattheethnicitiesbeidentical(oras
closetoitaspossible),andthatthesequencingqualitybehighand,ideally,basedonthe
sametechnologyplatformandmethods.
Thealgorithmthatwehavepresentedisarules-basedprocedure,ratherthana
maximumlikelihoodapproach.Conceptually,therearemanymethodsthatcanbeusedto
inferphaseinpedigreesusingmaximumlikelihood.Forexample,onecouldusetheElstonStewartalgorithm(ElstonandStewart1971),asimplementedinLINKAGE(Lathropetal.
1984)orFASTLINK(Cottinghametal.1993),toinferphase.Thepracticaldifficultyisthat
theElston-Stewartalgorithmisverylimitedinthenumberofvariantsthatcanbejointly
analyzed,whichwouldmakeitnecessarytobreakthegenomeupintoavastnumberof
overlappingsegments,whosephasedhaplotypedatawouldthenhavetobepatched
together.Alternatively,theLander-Greenalgorithm(LanderandGreen1987),as
implementedinthesoftwarepackageMERLIN(Abecasisetal.2002),cananalyzemany
variantssimultaneously,butarelimitedinthenumberofindividualsthatcanbehandled.
Forlargepedigrees,thiswouldthennecessitatebreakingthemintooverlapping
subcomponents,followedbyre-combiningthephasedgenotypes.Thesearenottrivial
24
bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
tasks,sincejoiningmarkersegmentsofpedigreefragmentsmayleadtoMendelian
inconsistenciesandotherproblems.ToovercomesomeofthelimitationsoftheElstonStewartalgorithmandtheLander-Greenalgorithm,anumberofMonteCarloMarkovChain
methodshavebeendeveloped,includingLOKI(Heath1997)andSimWalk2(SobelandLange
1996).However,thesemethodsareextremelycomputationallyintensiveandarenot
readilyapplicableonwholegenomesequencedata.Allofthesemethodsarealsoimpacted
bygenotypingerrorsandmissinggenotypedatatovaryingdegrees,andarenopanacea.
Therearealsoanumberofmethodsforphasingofsingletonindividuals(Browning
andBrowning2011),whichareoftenbasedonhiddenMarkovmodels,suchas
BEAGLE(BrowningandBrowning2007).Currently,themostprominentphasingmethodfor
singletonindividualsarevariantsoftheapproximatecoalescentmodels,suchas
SHAPEIT(Delaneauetal.2011)andMACH(Lietal.2010).Theiraccuracyandcomputational
speedhasgreatlyimprovedrecently,aidedbytheavailabilityofever-largerreference
panelsforsomeofthemajorethnicgroups.Theproblemswithapplyingthesemethodsto
pedigreesisthatsuitablereferencepanelsarenotpresentlyavailableinmanysmaller
populations,whichareparticularlysuitableforpedigreestudies.Withoutlargereference
panels,statisticalphasingmethodsperformmuchworse.Inaddition,treatingfamily
membersasunrelatedindividualsduringphasingwilloftenresultinphaseddatathatis
inconsistentwithMendelianrulesofinheritance.Aspartofourresearchforthispaper,we
haveinvestigatedtheutilityofduoHMM(O'Connelletal.2014),whichisimplemented
withintheprogramSHAPEIT2(Delaneauetal.2011),whichseekstocorrecthaplotypes
producedbystatisticalphasingusingtherestrictionsoninheritanceobservedinduos.
However,whenappliedtothesimulationsdescribedinthisstudythismethodfrequently
failedtorunonmostofthesimulatedpedigreestructures,andthuswechosetoexcludeit
fromthebenchmarkingresultsthatwerepresented.
25
bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
Giventhedifferentweaknessesandlimitationsofthevariousalternativephasing
methodswhenappliedtopedigreedata,thereappearstobeaneedforarationalmethodof
combiningphaseresultsfrommultiplemethods,especiallythosewithdifferentpremises.
Resultsfromacombinationofmethodsmayprovetobemoreaccurateandcomplete,yet
alsopractical.Forexample,thephasingoutputfromPULSARcouldbeusedasinputfor
MonteCarloMarkovChainmethods,providinganinitialplausiblephasedgenotypingstate
thatcouldthenberefinedbyMCMC.Ortheoutputfromstatisticalphasingforsingletons
maybetakenasinputforpedigree-basedphasingmethods.Inthisway,onecouldharness
thehighaccuracyofmethodsthatdirectlyobserveinheritancewithinpedigreeswith
statisticalphasingmethodsthatarecapableofinferringphaseinsegmentsofthegenome
whereinheritanceisnotdirectlyobservablegiventheindividualsthataresequenced.
Herewehavepresentedanovelalgorithm,PULSAR,withassociatedsoftware,for
phasingWGSgenotypesinabroadrangeofpedigreesizesandstructures.Thehigh
accuracyofthehaplotypesproducedbythePULSARalgorithm,whichoftenaccuratelyspan
entirechromosomes,ispromising.Theapproachmayalsobeausefultoolforgenotype
errorcheckingandcorrection.Basedonevidencefromourbenchmarkingweconcludethat
PULSARisasuitableandpracticalphasingapproachacrossabroadrangeofpedigree
structureswhenareasonableproportionofthepedigreemembersaresequenced.
Methods
Pedigreesusedforsimulationstudies
WechosetoinvestigatethephasingandimputationaccuracyofPULSARvia
simulationinavarietyofpedigreesizesandstructures.Nuclearpedigreestructureswere
composedoftwoparentsandarangeof1to7children.Asavariancereductiontechnique
inoursimulations,eachsmallerpedigreewasgeneratedasasubsetofthelargerpedigrees.
26
bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
Inotherwords,thepedigreewith1childisasubsetofthepedigreewith2children,andso
on.Sevenlarger,multigenerationalpedigreestructureswerechosenfromtheSanAntonio
MexicanAmericanFamilyStudies(Mitchelletal.1996;Huntetal.2005)(SAMAFS),having
11,14,20,28,55,78and94individualscomprising3,4,4,4,6,6and5generations,respectively.
Inthesepedigrees,4,3,3,16,29,31,and61individualsaresequenced,respectively.Diagrams
ofthesepedigrees,generatedusingCranefoot(Makinenetal.2005),areincludedinthe
supplementarymaterials.Altogether,14differentpedigreestructureswerechosento
representabroadrangeofsizesandstructuresinordertoinvestigatetheaccuracyand
completenessofthephasingresults.Twoadditionalpedigreestructuresweresimulatedfor
thepurposeofinvestigatingcomputationalscalabilityofthesoftware:pedigreeswithtwo
parentsandarangebetween1and1000children,andpedigreesinwhicheachgeneration
consistedofoneoffspringofthepreviousgenerationandonemarried-infounderfor2-50
generations.
Simulationofwholegenomesequencingdata
Wholegenomesequencingdatafor84maleXchromosomes(excludingthe
pseudoautosomalregions)fromBritish(GBR)andFinish(FIN)populationsfromthe1000
GenomesProject(GenomesProjectetal.2015)wereusedasthesetofpotentialfounder
chromosomesinthesimulationstudies.UtilizingrealmaleXchromosomeshasthe
advantagethatchromosomalhaplotypesareknownwhilealsomaintainingthecomplexities
ofrealwholegenomesequencingdata,suchastheminorallelefrequencydistributionof
variants,linkagedisequilibrium,andtheexistenceofsmallIBDsegmentssharedbetween
distantlyrelatedindividuals(BrowningandBrowning2007),thoughtheXchromosome
maydifferslightlyinthesecharacteristicsfromtheautosomes.Afterfilteringout
pseudoautosomalregionsandlociwithmorethan2alleles,therewereatotalof318,912
27
bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
polymorphicvariantsintheseeddataset.Theseseedchromosomeswereshortenedin
simulationto200,000variantsinordertoaccommodatelimitationsinthesoftwarepackage
AlphaPhase1.1(Hickeyetal.2011),byconsideringasmallersectionofthechromosome.
Alphaphase1.1wasusedforbenchmarkingbecauseitimplementsthe“longrangephasing”
approachinpedigrees(seebelow),althoughitwasnotdesignedforphasingWGSdata.
Withineachpedigree,founderchromosomesweresampledfromthese84chromosomes
randomlywithoutreplacement,whichassumesthatthepedigreefoundersrepresenta
randompopulationsample.Inheritanceofhaplotypeswithinthepedigreeswassimulated
by“genedropping”,utilizingaprobabilityofrecombinationbetweenvariantsthat
correspondsto1recombinationevery100Mb,aroughestimateoftheactual
recombinationrateinhumans.Chromosomesnotusedaspedigreefounderchromosomes
wereusedasareferencepanelfromwhichtocalculateminorallelefrequency.Insummary,
forindividualsinthesepedigreeswesimulatedgenotypesandchromosomalhaplotypes
withmanyofthecharacteristicsofrealdata.
Wholegenomesequencingdata
Wholegenomesequencinghasbeengeneratedformanyparticipantsinthe
SAMAFS.SNV and di-allelic INDELs with at least 5 observations of the minor allele were
homogenized and merged from vendor provided (Complete Genomics, Illumina) sequencing
genotype calls from 2330 directly sequenced genomes, resulting in 27,160,796 genetic variants.
ThisdataisavailablethroughdbGaPStudyAccession:phs000462.v1.p1.Chromosome 21
(356,545 variants) was selected for analysis. Individuals and variants with genotyping rates below
99% were excluded, resulting in 354,466 variants. Wechosetofocusonthesame7pedigrees
oftheSAMAFSforbothrealandsimulateddatasetsinordertomaximizethecomparability
ofthereal-worlddatatothesimulateddata.The7pedigreesarecomprisedofthree hundred
28
bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
individuals, of which 147 are sequenced.
Benchmarking
WesetouttobenchmarkPULSARagainstlongrangephasing(LRP)(Kongetal.
2008),arguablythecurrentbestapproachforphasinggenotypesinmultigenerational
pedigrees(BrowningandBrowning2011).AlphaPhase1.1(Hickeyetal.2011)isan
implementationoftheLRPalgorithmpioneeredbyKongetal(Kongetal.2008),which
phaseshaplotypesusingtheassumptionthatthegenotypedataofindividualssharinga
haplotypeIBDmustshareatleastonealleleidentical-by-state(IBS)ateachlocusinthe
sharedregion.Functionally,themethodsearchesforopposinghomozygousalleles,which
excludestwoindividualsfromsharingasegmentIBDatthatgenomiclocation(assumingno
mutationorgenotypingerror).LRPishighlyaccuratewhenatleastoneindividualsharing
theIBDsegmentishomozygousforagivenvariant.
Weutilizedthreemetricsfortheperformancecomparisoninthesimulations:
switcherrorrate(Linetal.2002)(SER)toassessphasingaccuracy,theproportionof
heterozygousgenotypesthatarephased,andthe‘time’functionintheLinuxenvironment
toassesstheruntimeofthesoftware.SERmeasurestherateinwhichadjacent
heterozygousgenotypesarephasedincorrectlywithrespecttoeachother.
DataAccess
PULSARisavailableathttps://github.com/AugustBlackburn/PULSAR_1.0.
Acknowledgements
WewouldliketothanktheparticipantsoftheSAMAFS.TheSAMAFSwholegenome
sequencedatawereobtainedaspartoftheT2D-GENESConsortium,whichissupportedby
29
bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
NIHgrantsU01DK085524,U01DK085584,U01DK085501,U01DK085526,andU01
DK085545.TheSanAntonioFamilyHeartStudyandSanAntonioFamily
Diabetes/GallbladderStudyprovidedpedigreecollectiondata,whicharesupportedbyNIH
grantsR01HL0113323,P01HL045222,R01DK047482,andR01DK053889.Thiswork
waspartiallyfundedbyNIHgrantDK099051.
DisclosureDeclaration
Theauthorsreportnoconflictinginterests.
Workscited
AbecasisGR,ChernySS,CooksonWO,CardonLR.2002.Merlin--rapidanalysisof
densegeneticmapsusingsparsegeneflowtrees.NatGenet30:97-101.
BrowningSR,BrowningBL.2007.Rapidandaccuratehaplotypephasingand
missing-datainferenceforwhole-genomeassociationstudiesbyuseof
localizedhaplotypeclustering.AmJHumGenet81:1084-1097.
BrowningSR,BrowningBL.2011.Haplotypephasing:existingmethodsandnew
developments.NatRevGenet12:703-714.
CottinghamRW,Jr.,IduryRM,SchafferAA.1993.Fastersequentialgeneticlinkage
computations.AmJHumGenet53:252-263.
DelaneauO,MarchiniJ,ZaguryJF.2011.Alinearcomplexityphasingmethodfor
thousandsofgenomes.NatMethods9:179-181.
ElstonRC,StewartJ.1971.Ageneralmodelforthegeneticanalysisofpedigreedata.
HumHered21:523-542.
GenomesProjectC,AutonA,BrooksLD,DurbinRM,GarrisonEP,KangHM,Korbel
JO,MarchiniJL,McCarthyS,McVeanGAetal.2015.Aglobalreferencefor
humangeneticvariation.Nature526:68-74.
HeathSC.1997.MarkovchainMonteCarlosegregationandlinkageanalysisfor
oligogenicmodels.AmJHumGenet61:748-760.
HickeyJM,KinghornBP,TierB,WilsonJF,DunstanN,vanderWerfJH.2011.A
combinedlong-rangephasingandlonghaplotypeimputationmethodto
imputephaseforSNPgenotypes.GenetSelEvol43:12.
HuntKJ,LehmanDM,AryaR,FowlerS,LeachRJ,GoringHH,AlmasyL,BlangeroJ,
DyerTD,DuggiralaRetal.2005.Genome-widelinkageanalysesoftype2
diabetesinMexicanAmericans:theSanAntonioFamilyDiabetes/Gallbladder
Study.Diabetes54:2655-2662.
KongA,MassonG,FriggeML,GylfasonA,ZusmanovichP,ThorleifssonG,OlasonPI,
IngasonA,SteinbergS,RafnarTetal.2008.Detectionofsharingbydescent,
long-rangephasingandhaplotypeimputation.NatGenet40:1068-1075.
30
bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
LanderES,GreenP.1987.Constructionofmultilocusgeneticlinkagemapsin
humans.ProcNatlAcadSciUSA84:2363-2367.
LathropGM,LalouelJM,JulierC,OttJ.1984.Strategiesformultilocuslinkage
analysisinhumans.ProcNatlAcadSciUSA81:3443-3446.
LiY,WillerCJ,DingJ,ScheetP,AbecasisGR.2010.MaCH:usingsequenceand
genotypedatatoestimatehaplotypesandunobservedgenotypes.Genet
Epidemiol34:816-834.
LinS,CutlerDJ,ZwickME,ChakravartiA.2002.Haplotypeinferenceinrandom
populationsamples.AmJHumGenet71:1129-1137.
MakinenVP,ParkkonenM,WessmanM,GroopPH,KanninenT,KaskiK.2005.Highthroughputpedigreedrawing.EurJHumGenet13:987-989.
MitchellBD,KammererCM,BlangeroJ,MahaneyMC,RainwaterDL,DykeB,Hixson
JE,HenkelRD,SharpRM,ComuzzieAGetal.1996.Geneticand
environmentalcontributionstocardiovascularriskfactorsinMexican
Americans.TheSanAntonioFamilyHeartStudy.Circulation94:2159-2170.
O'ConnellJ,GurdasaniD,DelaneauO,PirastuN,UliviS,CoccaM,TragliaM,HuangJ,
HuffmanJE,RudanIetal.2014.Ageneralapproachforhaplotypephasing
acrossthefullspectrumofrelatedness.PLoSGenet10:e1004234.
SobelE,LangeK.1996.Descentgraphsinpedigreeanalysis:applicationsto
haplotyping,locationscores,andmarker-sharingstatistics.AmJHumGenet
58:1323-1337.
SunL,DimitromanolakisA.2014.PREST-plusidentifiespedigreeerrorsandcryptic
relatednessintheGAW18sampleusinggenome-wideSNPdata.BMCProc8:
S23.
SunL,WilderK,McPeekMS.2002.Enhancedpedigreeerrordetection.HumHered
54:99-110.
TewheyR,BansalV,TorkamaniA,TopolEJ,SchorkNJ.2011.Theimportanceof
phaseinformationforhumangenomics.NatRevGenet12:215-223.
31