Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. AccuratePhasingofPedigreeGenotypesUsingWholeGenomeSequenceData BlackburnA.N.1*,KosM.Z.1,BlackburnN.B.1,2,PeraltaJ.M.1,2,StevensP.1, LehmanD.M.2,BlondellL.1,BlangeroJ.1,GöringH.H.H.1 1SouthTexasDiabetesandObesityInstitute,SchoolofMedicine,UniversityofTexasRio GrandeValley 2MenziesInstituteforMedicalResearch,UniversityofTazmania,Hobart,TAS,Australia 3DepartmentofMedicine,UniversityofTexasHealthScienceCenteratSanAntonio *Correspondingauthor: Address: AugustBlackburnPh.D. 3463MagicDriveSuite320 SanAntonio,TX78229 Phone:210-585-9773 E-mail:[email protected] RunningTitle:PhasingWholeGenomeSequencingDatainPedigrees Keywords:Phasing,Pedigrees,Sequencing,LineageSpecificAlleles,PULSAR 1 bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. Abstract Phasing,theprocessofpredictinghaplotypesfromgenotypedata,isanimportant undertakingingeneticsandanongoingareaofresearch.Phasingmethods,andassociated software,designedspecificallyforpedigreesareurgentlyneeded.Herewepresentanew methodforphasinggenotypesfromwholegenomesequencingdatainpedigrees:PULSAR (PhasingUsingLineageSpecificAlleles/Rarevariants).Themethodisbuiltupontheidea thatallelesthatarespecifictoasinglefoundingchromosomewithinapedigree,whichwe refertoaslineage-specificalleles,arehighlyinformativeforidentifyinghaplotypesthatare identical-by-decentbetweenindividualswithinapedigree.Throughextensivesimulation weassesstheperformanceofPULSARinavarietyofpedigreesizesandstructures,andwe exploretheeffectsofgenotypingerrorsandpresenceofnon-sequencedindividualsonits performance.IfthegenotypingerrorrateissufficientlylowPULSARcanphase>99.9%of heterozygousgenotypeswithaswitcherrorratebelow1x10-4inpedigreeswhereall individualsaresequenced.Wedemonstratethatthemethodishighlyaccurateand consistentlyoutperformsthelong-rangephasingapproachusedforcomparisoninour benchmarking.Themethodalsoholdspromiseforfixinggenotypeerrorsorimputing missinggenotypes.Thesoftwareimplementationofthismethodisfreelyavailable. Introduction Haplotypes,whicharecombinationsofallelesatdifferentpolymorphicsites occurringonthesameDNAmolecule,areimportantinthestudyofgenetics.(Tewheyetal. 2011)Haplotypesareusefulforimputationofallelesatungenotypedloci,identificationof genomicregionssharedidentical-by-descent,genotypeerroridentificationandcorrection, identificationofcompoundheterozygosity,andanalysisofparent-of-origineffects,among manyothertopics.Haplotypescanalsobeusedinplaceofallelesatindividualvariablesites 2 bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. inassociationtesting.Presentlythemostpopularsequencingplatformsandassociated softwarepackagesreportgenotypesforindividualpolymorphisms,andthushaplotypes mustbeinferredalgorithmically.Phasing,theprocessofanalyzinggenotypestopredict haplotypes,isthereforeanimportantundertakingingeneticsandanongoingareaof research.(BrowningandBrowning2011) Whilefamilystudieshavenotbeenpopularforassociation-basedgenemappingon complextraitsrecently,thissituationischangingwiththeemerginginterestinrare variants,whicharearguablymoreeasilyandmorepowerfullystudiedinfamilydata.Inany case,theincreaseinthenumbersofstudysubjectsinapopulationbeingsequencedwill necessarilyleadtoinclusionofmore–andmoreclosely–relatedindividuals,bychance.For phasing,pedigreesprovideadditionalinformationcomparedtounrelatedindividuals (whichareinrealitydistantlyrelated).Thedirectobservationofinheritanceofallelesfrom onegenerationtothenext,whichispossibleinfamilies,canbeusedtoestablishphase empirically,andhencefamiliesconceptuallyallowforhighlyaccurateestimationof haplotypes.Thisalsomeansthatphasingoffamilydatadoesnotnecessarilyhavetobear thecomputationalexpenseofdealingwithavastnumberofadditionalsamplesintheform ofreferencepanels,andthatfamilydatacanbeusefulinpopulationsforwhichgood referencepanelsarenotavailable.However,pedigreedatabringswithitsubstantialand uniquecomputationalchallenges.Thus,computationalmethodsandready-to-usesoftware tailor-madeforphasingrelatedindividualsinpedigreesusingwholegenomesequencedata areneeded. Herewedescribeanovelfastalgorithm,whichwehavenamedPULSAR(Phasing UsingLineageSpecificAlleles/Rarevariants),thatphaseswholegenomesequencingdata infamilies.PULSARisbuiltupontheideathatallelesthatarespecifictoasinglefounding chromosomewithinapedigree,whichwerefertoaslineage-specificalleles(LSAs),are 3 bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. highlyinformativeforidentifyinghaplotypesthatareidentical-by-decent(IBD)between individualswithinthepedigree.Inhumans,eachchromosomecarriesmanysuchrare variants,whichwecannowinterrogatewithwholegenomesequencing,allowingfor reasonablydensehaplotypemapscomprisedofLSAs,fromwhichwecanobserve inheritanceofhaplotypesempirically.Thisinitialhaplotypemapcanthenbeexpandedto includenon-LSAs,whichtendtobemorecommonvariants.Wedemonstratetheutilityof PULSARanditssoftwareimplementationonsimulateddataaswellasrealwholegenome sequencingdataandcompareitsperformancetolong-rangephasing. Results Descriptionofmethod ThegeneralstepsofthePULSARalgorithmareasfollows:1)identifyallelesthatare likelytobelineage-specific(i.e.,LSAs);2)identifyhaplotypes,theirboundaries,andtheir inheritanceusingLSAs;3)extendtheestimatedboundariesofthesehaplotypesbasedon theobservationthatindividualswillshareatleastonealleleatlociwheretheysharea haplotypeIBD(i.e.,theideabehindlongrangephasing);and4)assignallelestothe haplotypes.Ourmethodassumesthatthephysicallocationofvariantsisknown,and optionallymakesuseofexternalinformationregardingallelefrequencies.Below,wewill describethesestepsinmoredetail. Identifyingputativelineage-specificalleles ThepremiseofthePULSARalgorithmisthatwhenanalleleispresentinasingle chromosomeamongthefoundersofagivenpedigree,thenanydirectdescendentofthat founderalsocarryingtheallelecanbeassumedtosharethechromosomalsegment 4 bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. harboringthatlocusIBD,whichallowsforempiricalestimationoftheinheritanceof haplotypeswithinthepedigree.Exceptionstothislogic,suchasdenovomutation,arereal butinfrequentenoughnottounderminethepremiseoftheapproach.Sincenotallfounders willbesequencedinmanypedigrees,aninitialchallengeisidentifyingthoseallelesthatare specifictoasinglefoundingchromosome.WeidentifypotentialLSAsbyanalyzingthe patternofindividualswithinapedigreecarryingagivenallele.Theimplementationofthis approachissimple;wesearchforallelesforwhichallindividualscarryingthealleleshareat leastonefounderwithintheknownstructureofthepedigreeunderconsideration.We checkthatthedirectlineagebetweeneachfounderandeachpersoncarryingtheallelealso carriestheallele(atleastforthoseindividualsthataresequenced)andthatthealleleisnot homozygousinanyindividuals.Fornow,thealgorithmdoesnotaccommodatethepresence ofinbreedingloops,whereinLSAscouldbehomozygousininbredindividuals.Figure1A showsanexampleofafamilywhereindividualscarryanallelethatcannotbelineagespecificsincethecarriersdonotshareacommonfounder.Figure1Bshowsanexampleofa familywhereindividualscarryaputativelineagespecificallele.Notethatatthispointwe areonlyseekingtoidentifyputativeLSAs. 5 bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. Figure 1! A. B. Figure1.Inheritancepatternsoflineagespecificalleles.PanelAshowsanexampleofa patternofindividualscarryinganalleleinwhichtheallelecannotbelineage-specificsince individualscarryingthealleledonotshareacommonfounder.PanelBshowsanexample wheretheallelecouldbelineagespecific.Theindividualscarryingtheallelehavetwo commonfounders. ConstructinghaplotypesboundariesandestablishhaplotypeinheritanceusingLSAs UsingthesetofputativeLSAsfromthepreviousstep,wethenseektoestablish boundarieswithinwhichhaplotypesaresharedIBDbetweenasetofrelatedindividualsin apedigree.NotethatatthispointtheputativeLSAsmayincludesomefalsepositives. However,fornow,letusassumethatwearedealingonlywithtrueLSAs,andrelaxthis assumptionlater.ThesetofindividualscarryingatrueLSAwillshareanalleleIBDat nearbyloci,assumingabsenceofmeioticrecombinationbetweentheloci,mutation,or genotypingerror.Conversely,achangeinpattern(ofwhichindividualsshareaLSA)from 6 bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. onelocustothenextindicatesarecombinationevent(s)withinthemeiosesthatgiveriseto thehaplotypiclineage.Hence,bytrackingwhichindividualsshareneighboringLSAsalong thechromosomewecanestablishestheboundariesofagivenhaplotype.Whiletheconcept isstraightforward,complicationsarisebecauseatanygivenregionofthediploidgenomeall individualscarrytwohaplotypes,onematernalandonepaternal.Therefore,itisnecessary totracktwoseparatehaplotypessimultaneously.Weimplementthisapproachusinga rules-basedalgorithmtoidentifythechangesinthepatternsofLSAsharingalonga chromosome. Ifnotallindividualsaresequencedinagivenpedigree,whichisoftenthecase,there willlikelybesomefalsepositivesamongtheputativeLSAsidentifiedinthepreviousstep.In otherwords,someoftheputativeLSAsareinrealitynotIBDbutmerelyidentical-by-state (IBS).ToreducetheriskofmistakingIBSforIBD,andthusinferringwronghaplotypes,we requireapredeterminednumberofneighboringputativeLSAstobesharedinorderto demarcateanewhaplotype.AverylownumberofchangedLSApatternobservationsare requiredtoinfer,withhighconfidence,thatthereisatruerecombinationeventbecauseitis highlyunlikelythatthesameindividualswillshareputativeLSAsoveragivenregionofa chromosomeifthegenotypesareaproductofchance(suchasbeingIBSorperhapsdueto genotypingerror)andnottrulylineagespecific.Herewesetthisheuristicto5observations ofneighboringputativeLSAswiththesamechangedpatternofindividualssharingthe putativeLSAs. Afteridentifyingthehaplotypesaprobandcarriesalongthechromosome,we performtwoprocedurestoestablishchromosomallengthhaplotypesfromthesesmaller haplotypesegments.First,duringthecourseofidentifyingthehaplotypesifachangeis observedinonesetofindividuals(onehaplotypesegmentchanges)weassumethatthe previousandnewhaplotypesegmentsareonthesamechromosomeinthepersonwhere 7 bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. thepatternhaschanged.Second,fornon-founderswithasufficientnumberofinformative relativeswithsequencingdataweidentifyifthehaplotypeissharedwiththematernalor paternalsideoftheproband’srelatives.Sinceeachprobandinheritsonechromosomefrom eachparent,haplotypesthataresharedwitheitherpaternalormaternalrelativesare assumedtobeoneitherthepaternallyormaternallyinheritedchromosome,respectively. ExtensionofhaplotypeboundariesusingIBSallelesharing Whilethehumangenomecontainsavastnumberofrarevariants,manyofwhich willbelineage-specificinagivenpedigree,thedensityofLSAsnonethelesslimitsthe precisionwithwhichtheboundariesofhaplotypescanbedetermined.Weextendthe haplotypeboundariesobservedinaprobandwithasimpleheuristic,namelythat individualssharingahaplotypemustshareanalleleIBD(andthusalsoIBS)ateachlocusin thehaplotype.Thissameheuristiciscentraltotherationalebehindlong-rangephasing,and ourimplementationissimilarinthatwesearchforopposinghomozygousgenotypes.The primarydifferenceisthatwearenotusingthisrationaletodiscoversharedhaplotypes, ratheronlyextendingpredeterminedhaplotypes,carriedbyknownindividuals,for relativelysmallunresolvedsegmentsofthegenome(typically<1%ofthegenome;we presentaninvestigationofthedensityofLSAcoveragebelow).Thishaplotypeextension stepproceedsinbothdirectionsintotheunassignedgapbetweentwoneighboring haplotypesidentifiedinthepreviousstep,andweextendhaplotypesonlysofarthatthere isnooverlap. Mappingallelestohaplotypes Afterestablishingthehaplotypeboundariesandinheritancethroughoutthe pedigree,wethenmapallelesforallvariantsontothesehaplotypes.Thisstepcouldbe 8 bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. integratedintothepriorsteps,butinourimplementationweassignallelestohaplotypes (includingLSAs)asaseparate,finalstepoftheprocedure.Homozygousgenotypesare straightforwardtomapontohaplotypes,witheachofthetwohaplotypescarryingthesame allele;thisisdonefirst.Consideringeachvariantindependently,wethenphase heterozygousgenotypesinindividualsforwhichatleastoneallelehasalreadybeen mappedontooneofthetwohaplotypestheycarryatthatgenomiclocation.Werepeatthis mappingprocessiteratively,nowincorporatingthemappedallelesfromheterozygous genotypesfromthepreviousiteration,untilnoadditionalgenotypescanbephased.When allindividualswithinapedigreearesequenced,andallofthehaplotypeseachindividual carriesareknown,barringgenotypingerrorsthisprocedurecanresolvethephaseofall combinationsofgenotypeswithexceptionofthecasewhereeveryindividualis heterozygous,ascenariothatbecomeslesslikelyinlargerpedigrees. PULSARconsidersevidencefrommultipleindividuals(ifmultiplepeoplecarrythe haplotype)whenassigningallelestohaplotypes.Inthepresenceofgenotypingerrorsit becomespossibleforthegenotypesofmultipleindividualstoprovideconflictingsupport forwhichalleleiscarriedonagivenhaplotype.Asanexample,twoindividualssharea haplotypebuthaveopposinghomozygousgenotypes.Intheseambiguouscases,whenmore thanonealleleissupportedbytheinferredhaplotypesharingandindividualgenotypes,the allelesupportedbythemajorityofindividualsisassignedtothathaplotype.Whenthe correctallelesareassignedtobothhaplotypescarriedbyanindividual,reconstructed genotypesfromthesehaplotypescanbeusedtocorrectgenotypingerrors.However,in somecasesthereisnomajorityofindividualswithwhichPULSARcancorrectgenotyping errors.Asanexample,intriosnohaplotypeissharedbetweenmorethantwopeople,and thusnomajoritycanbeformedinordertocorrectgenotypingerrors.Thissameprincipleis trueforhaplotypesthataresharedbetweenonlytwoorfewerindividualswithinlarger 9 bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. pedigrees,suchashaplotypessharedbetweenmarried-infoundersandonlyoneoffspring whenthelineageisnotpassedontosubsequentgenerations. Benchmarking Coverageandallele-frequencydistributionoflineage-specificalleles Aswedescribed,thePULSARalgorithmisbasedontheideathatvariantsthatare introducedintoapedigreeviaasinglefoundingchromosome,i.e.LSAs,canbeusedas convenienttagstotracetheinheritanceofhaplotypeswithinpedigrees.Forthisapproach tobepractical,itiscrucialthatLSAsexistinsufficientdensityinrealisticpedigree structures.Thus,wehaveestimatedthedegreeofLSAcoverageusingrealsequencingdata withknownphase,namelytheXchromosomesfromtheBritish(GBR)andFinnish(FIN) cohortsfromthe1000GenomesProject(GenomesProjectetal.2015).Ifweviewpedigree foundersasarandompopulationsample,thenvariouskeyaspectsofLSAcoveragecan easilybedeterminedbypermutationfordifferentpedigreestructures.SupplementalFigure 1showsthedistributionofthenumberofLSAsperMbalongtheXchromosomeasa functionofthenumberofpedigreefoundersfortheGBRandFINcohorts.InBritish,the densityrangesfromamedianof135.4LSAsperMbin2-founderpedigrees(medianinterLSAdistanceof874bp,withamaximumof5.14Mbwhichincludesagapincoveragecaused bythecentromere)to14.8LSAsperMbwith15founders(mediandistanceof19.0Kb,with amaximumof5.39Mbwhichincludesagapincoveragecausedbythecentromere).We estimatethatapedigreewith170founderswouldhave~1.0LSAsperMb.InFinns,a populationwithalowergeneticdiversity(primarilyduetoasmallfoundingpopulation), therearecomparativelyfewerLSAs,amedianof133.1LSAsperMbin2-founderpedigrees and13.4LSAsperMbwith15founders.ThetrendtofewerLSAsinlargerpedigreesmake 10 bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. sensesinceourdefinitionofLSAisbasedonasinglefoundingoccurrenceperpedigree,and thusfewerpolymorphicsitesqualifythemorefoundersapedigreecontains. WithregardtotheallelefrequencydistributionofLSAs,sincetheprobabilityofa singlefoundingeventinapedigreedependsonthepopulationprevalenceofanallele,rare allelesaremorelikelytobespecifictoasinglefoundingchromosome,andtheenrichment forrareallelesamongallLSAswillbegreaterinpedigreeswithmorefounders.Thisiswhat weobserve.InBritish,with2founders~13%ofLSAshaveminorallelefrequencies<5% (medianallelefrequencyis21.7%),whereaswith15founders~91%ofLSAshaveminor allelefrequencies<5%(medianallelefrequencyis2.2%).Theseobservationsindicatethat MAFestimates,eitherfromthedatasetinhandorfromareferencepanel,areexpectedtobe ausefulfilterfordecreasingthefalsepositiveratewhenidentifyingLSAsisambiguous (perhapsduetonon-sequencedindividuals).Inanycase,theimportanttake-homemessage withregardtoourphasingmethodisthatinhumansthereappeartobesufficientlymany LSAstomakethemusefulasastartingpointforphasingofWGSdatainpedigrees. Pedigreeswithcompletesequencing WemeasuredtheperformanceofPULSARandAlphaPhase1.1inpedigreesvarying insizeandstructureundertheidealscenarioinwhichallindividualsinthedatasetare sequencedwithoutgenotypingerrors.Table1presentsacomparisonoftheSERandthe percentageofheterozygousmarkersphasedforthesesimulations.Acrossallsimulations, PULSARproducedlowerSERsandahigherpercentageofheterozygousmarkersphased thanAlphaPhase1.1.Inthecaseofnuclearpedigreessomemarkerswereheterozygousin allindividualsinapedigree,asituationthatisunresolvablewithoutexternaldata. Excludingtheseunresolvablecases,theobservednumberofphasedheterozygous genotypeswas>99.9%ofthetheoreticalupperlimitusingPULSAR. 11 bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. Effectofgenotypingerrors WesoughttoassesstherobustnessofPULSARtogenotypingerrors.Table2 presentsacomparisonoftheswitcherrorrateandthepercentageofheterozygousmarkers phasedforsimulationswhereinthegenotypingaccuracyis99.0%(assuming,forsimplicity, anequalerrorrateforallvariants).Again,PULSARproducedlowerswitcherrorratesand higherpercentageofheterozygousmarkersphasedthanAlphaPhase1.1. Table1.CompleteandAccurateData SwitchErrorRate PedigreeSize 3 4 5 6 7 8 9 11 14 20 28 55 78 94 Founders 2 2 2 2 2 2 2 5 6 6 8 17 21 28 Generations 2 2 2 2 2 2 2 3 4 4 4 6 6 5 PULSAR 0 0 -5 1.04x10 -5 1.28x10 -5 1.79x10 -5 1.56x10 -5 1.68x10 -5 1.90x10 -5 4.85x10 -4 1.07x10 -5 3.43x10 -4 1.15x10 -4 7.15x10 -4 2.64x10 AlphaPhase1.1 -2 4.17x10 -2 8.46x10 -2 9.20x10 -2 8.44x10 -2 8.93x10 -2 9.28x10 -2 7.84x10 -2 5.52x10 -2 5.46x10 -2 4.10x10 -2 4.73x10 -2 5.50x10 -2 5.53x10 -2 6.26x10 PercentageofHeterozygous GenotypesPhased PULSAR AlphaPhase1.1 0.8164 0.7425 0.8499 0.8193 0.9420 0.8994 0.9887 0.9312 0.9998 0.9398 0.9999 0.9514 0.9999 0.9464 0.9996 0.9794 0.9979 0.9815 0.9966 0.9846 0.9989 0.9849 0.9989 0.9747 0.9973 0.9744 0.9966 0.9696 Table2.GenotypingErrors SwitchErrorRate PedigreeSize 9 11 14 20 28 55 78 94 Founders 2 5 6 6 8 17 21 28 Generations 2 3 4 4 4 6 6 5 PULSAR -3 1.74x10 -3 8.33x10 -3 8.44x10 -3 6.99x10 -3 5.95x10 -3 7.51x10 -3 9.28x10 -3 8.41x10 12 AlphaPhase1.1 -2 7.22x10 -2 6.95x10 -2 6.22x10 -2 5.33x10 -2 6.36x10 -2 6.71x10 -2 6.73x10 -2 7.60x10 PercentageofHeterozygous GenotypesPhased PULSAR AlphaPhase1.1 0.9997 0.8960 0.9961 0.8800 0.9837 0.9093 0.9830 0.9316 0.9912 0.9242 0.9629 0.9149 0.9437 0.9228 0.9026 0.9091 bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. Forthecaseofnuclearpedigrees,wecreatedmultiplesimulatedgenotyping datasetswitharangeofgenotypingaccuracies(from99.0%to100.0%).Genotypingerrors weresimulatedfor20datasetsateachgenotypingaccuracylevelforeachpedigree.As showninFigure2,asthegenotypingerrorrateincreases,theswitcherrorrateincreases linearlyinthesimulatedpedigrees.Genotypingerrorshadanattenuatednegativeaffecton switcherrorrateinthepedigreeswithmorechildren.Thereasonbeingthatinalarger pedigree,moreindividualspotentiallysharegenomicsectionsIBDwithoneanother, enablingPULSARtoidentifyhaplotypesmorereliablyinthepresenceofgenotypingerrors. 0.025 Figure 2! 0.015 ● ● 0.010 Switch error rate 0.020 1 child 2 children 3 children 4 children 5 children 6 children 7 children ● 0.005 ● 0.000 ● ● 0.000 0.002 0.004 0.006 Simulated genotype error rate 0.008 0.010 Figure2.EffectofgenotypingaccuracyandIBDsharingonswitcherrorrate.Theplot showstheswitcherrorrates(withestimatederrorbar)from20simulationsatvarying genotypingaccuraciesinnuclearpedigreeswitharangeof1-7children,inwhicheach pedigreewithfewerchildrenisasubsetofthepedigreeswithmorechildren.Genotyping errorsincreasetheswitcherrorrateinapredictablemanner.IncreasedIBDsharingwithin thepedigreeattenuatesthiseffect. 13 bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. BasedonthehaplotypesoutputbyPULSARwereconstructedgenotypesand comparedthemtothetruegenotypes(i.e.,withoutintroducedgenotypingerrors).As showninFigure3,theaccuracyofthereconstructedgenotypesdecreaseslinearlywiththe simulatederrorrateacrosspedigrees.Thereconstructedgenotypesaccuraciesfromthetrio pedigreeareverysimilartotheaccuracyofthesimulatedgenotypes,whichisexpected becausethepedigreestructurelacksamajorityofhaplotypeswithwhichtocorrect genotypingerrors(asoutlinedpreviously).Infact,thereconstructedgenotypeerrorrateis slightlyhigherthanthetruegenotypeerrorrate,becausethealgorithmintroducessome (thoughveryfew)errorswhenhaplotypeboundaryestimationisimperfect.However, PULSARisabletocorrectgenotypesinpedigreeswith2ormorechildren,performing betterinpedigreeswithmorechildren.Togethertheseobservationsalsodemonstratethat increasedIBDsharingbetweenpedigreemembersimprovestheperformanceofthe PULSARalgorithm. 14 bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. 0.010 Figure 3! ● 0.006 ● 0.004 ● ● 0.000 0.002 Reconstructed genotype error rate 0.008 1 child 2 children 3 children 4 children 5 children 6 children 7 children ● ● 0.000 0.002 0.004 0.006 0.008 0.010 Simulated genotype error rate Figure3.EffectofIBDsharingoncorrectinggenotypesthroughimputation.The plotshowstheaccuracies(andassociatedestimatederror)ofgenotypes reconstructedfromhaplotypesproducedbyPULSARfor20simulationsatvarying genotypingaccuraciesinnuclearpedigreeswitharangeof1-7children,inwhich eachpedigreewithfewerchildrenisasubsetofthepedigreeswithmorechildren. Thediagonalispresenttoshowtheequivalencebetweentheaxes.IncreasedIBD sharingimprovestheabilityofthePULSARalgorithmtocorrectgenotypesthrough imputation. Effectofungenotypedindividuals Ungenotypedindividualsinapedigreeareanotherfrequentcomplicationinreal datasets.PULSARwillnotphaseheterozygousgenotypesforindividualsingenomicregions wheretheindividualdoesnotshareachromosomalsegmentIBDwithanothersequenced 15 bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. individual,unlessdoneerroneouslyafterfalselyinferringIBDsharing.Non-sequenced individualscanaffectthefalsepositiverateamongputativeLSAsidentifiedbyPULSAR, whichwillultimatelydecreasetheaccuracyofhaplotypeboundaryandsharingestimation. Table3presentsacomparisonoftheswitcherrorrateandthepercentageof heterozygousmarkersphasedforsimulationswhereinsomeindividualsaremissing sequencingdata.FortheSAMAFSpedigrees,individualswithrealsequencingdatawere givensequencingdatainthesimulation.Weperformedtwosimulationsinvolvingthe nuclearfamily,whereinoneandbothparentswerenotsequenced.Thecomparative estimatesforthescenarioofcompletegenotypingofeverypedigreememberarealready showninTable1.Thenumberofungenotypedindividualsand,moreimportantly,the locationoftheungenotypedindividualswithinthepedigreehavenotableeffectsonthe haplotypephasingresults.Inthecaseofthenuclearfamilies,childrenmissingsequencing datahaveonlyasmallnegativeeffectontheaccuracyofPULSAR,asthissituationdoesnot leadtoanincreaseinthenumberoffalselyinferredLSAs,thoughthenumberof observationsofeachhaplotypeisreducedonaverage.Childreninanuclearfamilymissing sequencingdataareessentiallythesameasreducingthesizeofthenuclearfamilybythe numberofunsequencedchildren.Aconsequenceisthatahigherportionofvariantswillbe heterozygousinallsequencedindividuals(thisisonlyofpracticalimportanceifthenumber ofsequencedindividualsinapedigreeissmall).Forthatreason,themainimpactofmissing childreninnuclearfamiliesisaloweredpercentageoflocithatarephasedbecausePULSAR doesnotresolvecaseswhereinallindividualsareheterozygous(seetheestimatesfrom Table1fornuclearfamiliesofdifferentsizes).Theeffectofmissingparentsismuchgreater becausethisscenarioleadstoambiguityinidentifyingLSAs.Inthesimulatedpedigreewith onemissingparentthefalsepositiverateamongputativeLSAswas15.9%comparedto0% 16 bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. whenbothparentsaresequenced,andtheSERincreasedfrom1.7x10-5to4.0x10-2.Thetotal numberofheterozygousmarkersphaseddroppedfrom99.99%to99.45%. Mitigatingthenegativeeffectofungenotypedindividuals Wesoughttomitigatethenegativeeffectofungenotypedindividualsontheperformanceof PULSAR.WeinvestigatedtheextenttowhichfilteringLSAsbasedonminorallelefrequency wouldhelp.Table3alsopresentsacomparisonoftheeffectsofmissingindividualsin nuclearfamilieswhereinwehavepre-filteredputativeLSAsbasedonaminorallele frequencyof<5%.Inthenuclearpedigreewith7childrenwithonenon-sequencedparent, therateoffalselyinferredLSAswas15.9%priortofilteringand1.50%afterward.TheSER droppedfrom4.0x10-2to1.5x10-3.Pre-filteringisnotwithoutcost,however.Thetotal numberofputativeLSAsdroppedfrom61,160to7,850.Thetotalpercentageof heterozygousmarkersphaseddroppedfrom99.45%to98.58%.Inthenuclearfamily, consideringthetradeoffsbetweentheaccuracyofthephasingresults(measuredusingSER) andthepercentageofheterozygousmarkersphased,filteringbasedonMAFperformed better,improvingtheoverallnumberofheterozygousmarkersphasedcorrectly,whichcan bemeasuredbytheproductoftheaccuracy(1-SER)andthepercentageofheterozygous markersphased. Inthelargestpedigreeinthisstudy,wherein61of94(64.9%)individualshave sequencingdata,filteringusingMAF<5%hadonlyaminoreffectontheresults.Thefalse positivesamongputativeLSAsdroppedfrom0.0099to0.0091,thetotalnumberofputative LSAsdroppedfrom44889to40060,theSERwentupfrom2.54x10-3to2.98x10-3,andthe percentageofheterozygousvariantsphaseddroppedfrom97.4%to97.3%.Inpedigrees withmanyfounders,theputativeLSAsthatPULSARidentifiesalreadyhavelowallele frequenciesforthemostpart,andhencefilteringata5%prevalencethresholdhaslittle 17 bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. effect.ItispossiblethatanevenlowerMAFfilterwouldbeadvantageous,thoughitisclear thatMAFfilteringismoreadvantageousinsmallerpedigreeswithfewerfounders. Non-sequencedindividualshaveanegativeeffectontheperformanceofthe PULSARalgorithm,butnotinallcases.Sinceitisnotalwaysstraightforwardtoestimate whichcasesofmissingindividualsaredetrimentaltotheeffectivenessofthePULSAR algorithmwehavecreatedatoolthatusesaMonte-Carlogene-droppingapproachto estimatetheexpectedpercentageofthegenomethateachindividualwillshareatleastone haplotypeIBDwithanothersequencedindividualinthepedigree.Thistoolcanbeusedfor aprioriestimationoftheapplicabilityofthePULSARalgorithmtoapedigreedatasetgiven thesequencedindividualswithinthepedigree. Table3.MissingIndividuals Founders Generations (sequenced) PedigreeSize (sequenced) 9(7) 9(8) 11(4) 14(3) 20(3) 28(16) 55(29) 78(31) 94(61) 2(0) 2(1) 5(1) 6(0) 6(0) 8(2) 17(3) 21(5) 28(9) 2 2 3 4 4 4 6 6 5 SwitchErrorRate PULSAR PUSLAR w/prefiltering -2 -4 1.04x10 2.44x10 -2 -3 4.04x10 1.46x10 -2 -3 3.58x10 1.52x10 -5 -5 5.30x10 5.30x10 -2 -2 5.95x10 5.48x10 -2 -3 1.11x10 3.08x10 -2 -2 2.00x10 1.44x10 -3 -3 8.17x10 5.76x10 -3 -3 2.54x10 2.98x10 AlphaPhase1.1 -2 8.75x10 -2 1.41x10 -2 8.22x10 -2 5.34x10 -2 4.54x10 -2 6.35x10 -2 7.50x10 -2 7.21x10 -2 6.57x10 PercentageofHeterozygous GenotypesPhased PULSAR PUSLAR AlphaPhase1.1 w/prefiltering 0.8021 0.8048 0.7677 0.9945 0.9858 0.9443 0.6915 0.5473 0.6822 0.3266 0.3256 0.4128 0.6601 0.6587 0.0040 0.9944 0.9988 0.9730 0.9788 0.9609 0.8901 0.9945 0.9873 0.8989 0.9738 0.9726 0.9280 Performanceondatawithsequencingerrorsandmissingindividuals Wethensoughttobenchmarkourapproachinrealisticdata,havingbothmissing individualsandgenotypingerrors.Todothisweonlyexaminedgenotypesforindividuals withwholegenomesequencingdataavailableintheSAMAFS.Thenuclearpedigreewasset upsothatgenotypesforoneparentweremissing.Wesimulatedgenotypeswitha conservativegenotypeaccuracyof99.0%.Mostsequencingplatformsoutperformthis 18 bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. benchmarkeasilywithappropriatereaddepth,andthus99.0%accuracyisconservative. ResultsforthesesimulationsarepresentedinTable4.Withfewexceptions,PULSAR outperformedAlphaPhase1.1bothinaccuracyandincompleteness. Table4.RealisticData SwitchErrorRate PedigreeSize (sequenced) 9(8) 11(4) 14(3) 20(3) 28(16) 55(29) 78(31) 94(59) Founders (sequenced) 2(1) 5(1) 6(0) 6(0) 8(2) 17(3) 21(5) 28(9) Generations 2 3 4 4 4 6 6 5 PULSARw/ pre-filtering -2 1.85x10 -2 1.85x10 -2 1.96x10 -2 7.25x10 -2 7.71x10 -2 3.83x10 -2 1.93x10 -2 1.45x10 AlphaPhase1.1 -1 1.32x10 -2 6.24x10 -2 5.71x10 * -2 7.50x10 -2 8.38x10 -2 8.22x10 -2 7.83x10 PercentageofHeterozygous GenotypesPhased PULSARw/ AlphaPhase1.1 pre-filtering 0.9370 0.8879 0.5207 0.5331 0.3327 0.3166 0.6569 * 0.9617 0.9026 0.8734 0.8155 0.9274 0.8295 0.8908 0.8393 *Notphasedsuccessfully Applicationtorealwholegenomesequencingdata Lastly,wesoughttoinvestigatetheperformanceofPULSARwhenappliedtoreal wholegenomesequencingdata.Todothisweusedwholegenomesequencinggenotypes forchromosome21fromtheSanAntonioMexicanAmericanFamilyStudy,focusingonthe samemultigenerationalpedigreesthatwehaveusedastemplatepedigreestructuresinour simulations.Thisdatasetcontainsallproblemsanderrorsinherentwithrealworlddata, includingungenotypedindividuals,somemissing(uncalled)genotypeswithinsequenced individuals,andgenotypingerrors.(Grosserrorsinpedigreerelationshipshadbeen previouslydetectedandcorrected,however,butsomelowlevelofkinshipbetween presumedunrelatedfoundersremainspossible.)Thedownsideofrealdataisthatthetrue state,heremainlyphasing,isunknown.Topermitamorecompletecharacterizationofthe performanceofthePULSARalgorithmontherealdata,wecreatedartificialblanksby maskinggenotypesrandomlyeachwithaprobabilityof1/1000,permittingustocompare 19 bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. theoriginalgenotypecallsandtheimputedgenotypesafterphasing.Wethenapplied PULSARtothemanipulatedWGSgenotypes.Wereconstructedgenotypesfromthephased haplotypesproducedbyPULSAR,andthencomparedthemtotheoriginalgenotypes.Table 5summarizestheresultsforthisphasingandgenotypeimputationexperiment.PULSAR wasabletophasethevastmajorityofobservedgenotypes(>97%inallpedigrees,exceptin oneofthesparselygenotypedpedigrees).Andthegenotypesreconstitutedfromphased haplotypesmatchedtheobserved,unblankedgenotypeswell(>98.5%concordance).For themaskedgenotypes,theconcordanceratebetweentheimputedgenotypesandthe maskedgenotypesrangedfrom95.5%to98.6%acrosspedigrees,indicatingahighdegree ofphasingaccuracyandgenotypeimputation.However,onlyafairlysmallpercentageof themaskedgenotypeswereimputed(<50%).Therewerealsosomeactualmissing genotypesintheoriginalsequencingdata.Thereisanotabledifferencebetweenthe percentagesofmaskedgenotypesandtrulymissinggenotypesPULSARwasabletoimpute. Thisdifferenceislikelyduetothenatureofhowthemissinggenotypesaredistributed.The maskedgenotypesweredistributedrandomly,whereasthetrulymissinggenotypeswere morelikelytobefoundatthesamemarkerinmultipleindividuals.Thiscouldbeduetoany numberofreasons,withonereasonablehypothesisbeingthatindelsorcopynumber variantsdisruptthediploidstateofthesemarkersinasubsetofrelatedindividuals.In summary,PULSARphasedthevastmajorityoftheobservedgenotypesaccurately. 20 bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. Table5.RealData Concordancerate between Percentage Pedigree genotypes ofnonSize Founders Generations reconstructed masked (sequenced) (sequenced) fromphasing genotypes resultsandnonphased maskedgenotypes 11(4) 14(3) 20(3) 28(16) 55(29) 78(31) 94(61) 5(1) 6(0) 6(0) 8(2) 17(3) 21(5) 28(9) 3 4 4 4 6 6 5 0.9948 1 0.9859 0.9899 0.9873 0.9943 0.9941 0.9769 0.9297 0.9950 0.9937 0.9885 0.9909 0.9858 Concordance ratebetween genotypes reconstructed fromphasing resultsand masked genotypes(ie. imputed genotypes) 0.9813 * 0.9545 0.9798 0.9689 0.9858 0.9823 Percentageof Percentage masked ofmissing genotypes genotypes phased/imputed imputed 0.1853 0 0.4221 0.4385 0.2450 0.3116 0.2866 0.0714 0 0.2025 0.3396 0.2074 0.2603 0.1437 *Nogenotypeswereimputed Computationtime Wesoughttoestimatethescalabilityofoursoftwaretolargerpedigrees.We simulatedpedigreeswithtwoparentsandarangeofchildrenbetween1and1000.This designallowsustoinvestigatetheeffectofincreasingtheIBDsharingwithinthepedigree. Itappearsthatthecomputationtimecanbemodeledwellusingasecondorderpolynomial basedonthenumberofindividualsinthepedigree.Weestimate,basedontheestimated polynomialfunction,thatapedigreewith2parentsand1000childrenwouldtakeroughly 5.5daystorunonaMacBookProlaptopwitha2.9GHzIntelCorei7processor.Wealso simulatedpedigreesinwhicheachgenerationconsistedofoneoffspringoftheprevious generationandonemarried-infounder.ThispedigreedesignwaschosenbecausetheIBD sharingwithinthepedigreedoesnotincreaseovergenerations.Wesimulatedpedigreesin therangeof2-50generationsinthismanner.Alinearmodelappearstobethebestfitto describetherelationshipbetweenthenumberofindividualsinthepedigreeandthetime PULSARtooktoanalyzethegenotypescreatedineachsimulation.Weestimatethata pedigreewith500generationsand999individualswouldtakeroughly88minutestorun. 21 bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. Thus,takentogether,theseresultsshowthatbothpedigreesizeandpedigreestructure affectruntimeofthePULSARalgorithm,withaportionoftheincreaseinpedigreesize scalinglinearly. Intherunsonthesimulateddatainthelargestpedigree(94individuals), AlphaPhase(93seconds)wasfasterthanPULSAR(938seconds)onaserverwith2.40GHz IntelXeonCorei7CPUsrunningCentOSLinux7.Althoughcomputationtimesbecome importantwithincreasingvolumesofdata,wefoundPULSARtobesufficientlyfastfor practicaluseonwholegenomesequencinginthesizeofpedigreespresentedhere. Discussion Accuratephasinginformationisanessentialstepinmanygeneticstudies. Nonetheless,therearecurrentlyfewtoolsdesignedspecificallytophasegenotypesina broadrangeofpedigreesizesandstructures,withmostavailablemethodsworkingonlyin unrelatedindividualsorinnuclearfamilies.Inthebenchmarkingstudieswehave undertaken,PULSARoutperformedlong-rangephasing(usingthesoftwareAlphaPhase1.1), producedlowswitcherrorrates,andphasedahighpercentageofheterozygousvariants. Whilethemethodclearlyhasmerit,ithassomeshortcomingsandlimitations.Inour simulationswehaveexploredtheimpactofnon-sequencedindividualsandgenotype errors,bothofwhichareunavoidableinrealdatasetsandweakentheperformanceof PULSAR,aswellasalternativeapproachesforphasing.Fromoursimulationsofrealistic data,weexaminedsomepedigreeswithverysparsesequencing,suchas3outof20 individualsbeingsequenced.Evenintheseproblematicpedigrees,withmanymissing individuals,PULSARperformedfairlywell,erringonthesideofphasingfewerheterozygous markersratherthanphasingthemincorrectly.Thepercentageofallelesindividualsshare IBDwithatleastoneothersequencedindividualisthecriticalcomponenttopredictingthe 22 bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. performanceofPULSAR.Iftheindividualssequencedwithinthepedigreearenotpredicted toshareahighpercentageofallelesIBDoneshouldconsideralternativephasing approaches,suchasthosedesignedforsingletons. Oursimulations,usingrealisticgenotypesandchromosomalhaplotypes,showthat PULSARisfairlyrobusttogenotypingerrors.InfactPULSARcanbeusedtocorrectsome genotypingerrors.PULSAR“fixes”incorrectgenotypesusingarudimentarymethod,a simplemajoritiesvoteamongindividualssharingahaplotype.Aweightingschemebasedon theconfidenceofgenotypecallsorbasedonthenumberofreadssupportingagivenallele callmayimprovePULSAR’sabilitytofixgenotypeerrors. Wehavenotinvestigatedtheimpactoferrorsinthepedigreestructuresthemselves onPULSAR(includingtheexistenceofunknownrelationshipsbetweenpedigreefounders). Weadvocatethatpedigreerelationshipsareconfirmedorestimatedanalytically(Sunetal. 2002;SunandDimitromanolakis2014)beforeoneembarksoneffortstoestablishphase, correctgenotypingerrors,orimputemissinggenotypes. PULSARwasdesignedspecificallytoworkonwholegenomesequencedata,based ontherationalethatthevastnumberofraresequencevariantsinthehumangenomewould produceahighdensityofLSAsinpedigrees.Whilewholegenomesequencingclearlyisthe preferredwaytocharacterizethegenome,atthepresenttimemanystudiesonlyinvolve exomesequencingdataordenseSNPgenotyping,mainlyduetocost.Wehavenot investigatedtheperformanceofPULSARontheseothergenotypingplatforms.SNP genotypingpanelsarebiasedtowardscommonSNPs.Forthatreason,weanticipatethat PULSARwouldbeoflimitedutilityforsuchdatainextendedpedigrees,whereonlyvery fewcommonSNPswouldbeintroducedonlyonceintoagivenpedigreeandthusserveas LSAstothealgorithm.Onsmallpedigreeswithrelativelyfewfoundersweexpectthatthe PULSARalgorithmwouldbemuchlessimpactedandworkquitewell. 23 bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. ThereareanumberofpotentialimprovementsandextensionstoPULSARthatwe couldpresent,butthesearebeyondthescopeofthismanuscript.However,wedidexplore theutilityofpre-screeningvariantsupfrontbasedonallelefrequencywhenidentifying putativeLSAs.Whenpedigreefoundersarenotavailableforsequencing,whichwilloftenbe thecaseinreality,thensuchfilteringbasedonminorallelefrequencywasshowntobea usefulprocedureforreducingthenumberoffalsepositivesamongputativeLSAsandthus improvetheperformanceofthemethod.However,allelefrequencyestimatesshouldbe reliabletobeeffectiveaspre-filters.Theycanbetakeneitherfromthestudyathand,if therearesufficientlymanyfounders,ortakenfromreferencepanels.Ifonetakesthe estimatesfromreferencepanels,oneshouldtakecarethattheethnicitiesbeidentical(oras closetoitaspossible),andthatthesequencingqualitybehighand,ideally,basedonthe sametechnologyplatformandmethods. Thealgorithmthatwehavepresentedisarules-basedprocedure,ratherthana maximumlikelihoodapproach.Conceptually,therearemanymethodsthatcanbeusedto inferphaseinpedigreesusingmaximumlikelihood.Forexample,onecouldusetheElstonStewartalgorithm(ElstonandStewart1971),asimplementedinLINKAGE(Lathropetal. 1984)orFASTLINK(Cottinghametal.1993),toinferphase.Thepracticaldifficultyisthat theElston-Stewartalgorithmisverylimitedinthenumberofvariantsthatcanbejointly analyzed,whichwouldmakeitnecessarytobreakthegenomeupintoavastnumberof overlappingsegments,whosephasedhaplotypedatawouldthenhavetobepatched together.Alternatively,theLander-Greenalgorithm(LanderandGreen1987),as implementedinthesoftwarepackageMERLIN(Abecasisetal.2002),cananalyzemany variantssimultaneously,butarelimitedinthenumberofindividualsthatcanbehandled. Forlargepedigrees,thiswouldthennecessitatebreakingthemintooverlapping subcomponents,followedbyre-combiningthephasedgenotypes.Thesearenottrivial 24 bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. tasks,sincejoiningmarkersegmentsofpedigreefragmentsmayleadtoMendelian inconsistenciesandotherproblems.ToovercomesomeofthelimitationsoftheElstonStewartalgorithmandtheLander-Greenalgorithm,anumberofMonteCarloMarkovChain methodshavebeendeveloped,includingLOKI(Heath1997)andSimWalk2(SobelandLange 1996).However,thesemethodsareextremelycomputationallyintensiveandarenot readilyapplicableonwholegenomesequencedata.Allofthesemethodsarealsoimpacted bygenotypingerrorsandmissinggenotypedatatovaryingdegrees,andarenopanacea. Therearealsoanumberofmethodsforphasingofsingletonindividuals(Browning andBrowning2011),whichareoftenbasedonhiddenMarkovmodels,suchas BEAGLE(BrowningandBrowning2007).Currently,themostprominentphasingmethodfor singletonindividualsarevariantsoftheapproximatecoalescentmodels,suchas SHAPEIT(Delaneauetal.2011)andMACH(Lietal.2010).Theiraccuracyandcomputational speedhasgreatlyimprovedrecently,aidedbytheavailabilityofever-largerreference panelsforsomeofthemajorethnicgroups.Theproblemswithapplyingthesemethodsto pedigreesisthatsuitablereferencepanelsarenotpresentlyavailableinmanysmaller populations,whichareparticularlysuitableforpedigreestudies.Withoutlargereference panels,statisticalphasingmethodsperformmuchworse.Inaddition,treatingfamily membersasunrelatedindividualsduringphasingwilloftenresultinphaseddatathatis inconsistentwithMendelianrulesofinheritance.Aspartofourresearchforthispaper,we haveinvestigatedtheutilityofduoHMM(O'Connelletal.2014),whichisimplemented withintheprogramSHAPEIT2(Delaneauetal.2011),whichseekstocorrecthaplotypes producedbystatisticalphasingusingtherestrictionsoninheritanceobservedinduos. However,whenappliedtothesimulationsdescribedinthisstudythismethodfrequently failedtorunonmostofthesimulatedpedigreestructures,andthuswechosetoexcludeit fromthebenchmarkingresultsthatwerepresented. 25 bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. Giventhedifferentweaknessesandlimitationsofthevariousalternativephasing methodswhenappliedtopedigreedata,thereappearstobeaneedforarationalmethodof combiningphaseresultsfrommultiplemethods,especiallythosewithdifferentpremises. Resultsfromacombinationofmethodsmayprovetobemoreaccurateandcomplete,yet alsopractical.Forexample,thephasingoutputfromPULSARcouldbeusedasinputfor MonteCarloMarkovChainmethods,providinganinitialplausiblephasedgenotypingstate thatcouldthenberefinedbyMCMC.Ortheoutputfromstatisticalphasingforsingletons maybetakenasinputforpedigree-basedphasingmethods.Inthisway,onecouldharness thehighaccuracyofmethodsthatdirectlyobserveinheritancewithinpedigreeswith statisticalphasingmethodsthatarecapableofinferringphaseinsegmentsofthegenome whereinheritanceisnotdirectlyobservablegiventheindividualsthataresequenced. Herewehavepresentedanovelalgorithm,PULSAR,withassociatedsoftware,for phasingWGSgenotypesinabroadrangeofpedigreesizesandstructures.Thehigh accuracyofthehaplotypesproducedbythePULSARalgorithm,whichoftenaccuratelyspan entirechromosomes,ispromising.Theapproachmayalsobeausefultoolforgenotype errorcheckingandcorrection.Basedonevidencefromourbenchmarkingweconcludethat PULSARisasuitableandpracticalphasingapproachacrossabroadrangeofpedigree structureswhenareasonableproportionofthepedigreemembersaresequenced. Methods Pedigreesusedforsimulationstudies WechosetoinvestigatethephasingandimputationaccuracyofPULSARvia simulationinavarietyofpedigreesizesandstructures.Nuclearpedigreestructureswere composedoftwoparentsandarangeof1to7children.Asavariancereductiontechnique inoursimulations,eachsmallerpedigreewasgeneratedasasubsetofthelargerpedigrees. 26 bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. Inotherwords,thepedigreewith1childisasubsetofthepedigreewith2children,andso on.Sevenlarger,multigenerationalpedigreestructureswerechosenfromtheSanAntonio MexicanAmericanFamilyStudies(Mitchelletal.1996;Huntetal.2005)(SAMAFS),having 11,14,20,28,55,78and94individualscomprising3,4,4,4,6,6and5generations,respectively. Inthesepedigrees,4,3,3,16,29,31,and61individualsaresequenced,respectively.Diagrams ofthesepedigrees,generatedusingCranefoot(Makinenetal.2005),areincludedinthe supplementarymaterials.Altogether,14differentpedigreestructureswerechosento representabroadrangeofsizesandstructuresinordertoinvestigatetheaccuracyand completenessofthephasingresults.Twoadditionalpedigreestructuresweresimulatedfor thepurposeofinvestigatingcomputationalscalabilityofthesoftware:pedigreeswithtwo parentsandarangebetween1and1000children,andpedigreesinwhicheachgeneration consistedofoneoffspringofthepreviousgenerationandonemarried-infounderfor2-50 generations. Simulationofwholegenomesequencingdata Wholegenomesequencingdatafor84maleXchromosomes(excludingthe pseudoautosomalregions)fromBritish(GBR)andFinish(FIN)populationsfromthe1000 GenomesProject(GenomesProjectetal.2015)wereusedasthesetofpotentialfounder chromosomesinthesimulationstudies.UtilizingrealmaleXchromosomeshasthe advantagethatchromosomalhaplotypesareknownwhilealsomaintainingthecomplexities ofrealwholegenomesequencingdata,suchastheminorallelefrequencydistributionof variants,linkagedisequilibrium,andtheexistenceofsmallIBDsegmentssharedbetween distantlyrelatedindividuals(BrowningandBrowning2007),thoughtheXchromosome maydifferslightlyinthesecharacteristicsfromtheautosomes.Afterfilteringout pseudoautosomalregionsandlociwithmorethan2alleles,therewereatotalof318,912 27 bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. polymorphicvariantsintheseeddataset.Theseseedchromosomeswereshortenedin simulationto200,000variantsinordertoaccommodatelimitationsinthesoftwarepackage AlphaPhase1.1(Hickeyetal.2011),byconsideringasmallersectionofthechromosome. Alphaphase1.1wasusedforbenchmarkingbecauseitimplementsthe“longrangephasing” approachinpedigrees(seebelow),althoughitwasnotdesignedforphasingWGSdata. Withineachpedigree,founderchromosomesweresampledfromthese84chromosomes randomlywithoutreplacement,whichassumesthatthepedigreefoundersrepresenta randompopulationsample.Inheritanceofhaplotypeswithinthepedigreeswassimulated by“genedropping”,utilizingaprobabilityofrecombinationbetweenvariantsthat correspondsto1recombinationevery100Mb,aroughestimateoftheactual recombinationrateinhumans.Chromosomesnotusedaspedigreefounderchromosomes wereusedasareferencepanelfromwhichtocalculateminorallelefrequency.Insummary, forindividualsinthesepedigreeswesimulatedgenotypesandchromosomalhaplotypes withmanyofthecharacteristicsofrealdata. Wholegenomesequencingdata Wholegenomesequencinghasbeengeneratedformanyparticipantsinthe SAMAFS.SNV and di-allelic INDELs with at least 5 observations of the minor allele were homogenized and merged from vendor provided (Complete Genomics, Illumina) sequencing genotype calls from 2330 directly sequenced genomes, resulting in 27,160,796 genetic variants. ThisdataisavailablethroughdbGaPStudyAccession:phs000462.v1.p1.Chromosome 21 (356,545 variants) was selected for analysis. Individuals and variants with genotyping rates below 99% were excluded, resulting in 354,466 variants. Wechosetofocusonthesame7pedigrees oftheSAMAFSforbothrealandsimulateddatasetsinordertomaximizethecomparability ofthereal-worlddatatothesimulateddata.The7pedigreesarecomprisedofthree hundred 28 bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. individuals, of which 147 are sequenced. Benchmarking WesetouttobenchmarkPULSARagainstlongrangephasing(LRP)(Kongetal. 2008),arguablythecurrentbestapproachforphasinggenotypesinmultigenerational pedigrees(BrowningandBrowning2011).AlphaPhase1.1(Hickeyetal.2011)isan implementationoftheLRPalgorithmpioneeredbyKongetal(Kongetal.2008),which phaseshaplotypesusingtheassumptionthatthegenotypedataofindividualssharinga haplotypeIBDmustshareatleastonealleleidentical-by-state(IBS)ateachlocusinthe sharedregion.Functionally,themethodsearchesforopposinghomozygousalleles,which excludestwoindividualsfromsharingasegmentIBDatthatgenomiclocation(assumingno mutationorgenotypingerror).LRPishighlyaccuratewhenatleastoneindividualsharing theIBDsegmentishomozygousforagivenvariant. Weutilizedthreemetricsfortheperformancecomparisoninthesimulations: switcherrorrate(Linetal.2002)(SER)toassessphasingaccuracy,theproportionof heterozygousgenotypesthatarephased,andthe‘time’functionintheLinuxenvironment toassesstheruntimeofthesoftware.SERmeasurestherateinwhichadjacent heterozygousgenotypesarephasedincorrectlywithrespecttoeachother. DataAccess PULSARisavailableathttps://github.com/AugustBlackburn/PULSAR_1.0. Acknowledgements WewouldliketothanktheparticipantsoftheSAMAFS.TheSAMAFSwholegenome sequencedatawereobtainedaspartoftheT2D-GENESConsortium,whichissupportedby 29 bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. NIHgrantsU01DK085524,U01DK085584,U01DK085501,U01DK085526,andU01 DK085545.TheSanAntonioFamilyHeartStudyandSanAntonioFamily Diabetes/GallbladderStudyprovidedpedigreecollectiondata,whicharesupportedbyNIH grantsR01HL0113323,P01HL045222,R01DK047482,andR01DK053889.Thiswork waspartiallyfundedbyNIHgrantDK099051. DisclosureDeclaration Theauthorsreportnoconflictinginterests. Workscited AbecasisGR,ChernySS,CooksonWO,CardonLR.2002.Merlin--rapidanalysisof densegeneticmapsusingsparsegeneflowtrees.NatGenet30:97-101. BrowningSR,BrowningBL.2007.Rapidandaccuratehaplotypephasingand missing-datainferenceforwhole-genomeassociationstudiesbyuseof localizedhaplotypeclustering.AmJHumGenet81:1084-1097. BrowningSR,BrowningBL.2011.Haplotypephasing:existingmethodsandnew developments.NatRevGenet12:703-714. CottinghamRW,Jr.,IduryRM,SchafferAA.1993.Fastersequentialgeneticlinkage computations.AmJHumGenet53:252-263. DelaneauO,MarchiniJ,ZaguryJF.2011.Alinearcomplexityphasingmethodfor thousandsofgenomes.NatMethods9:179-181. ElstonRC,StewartJ.1971.Ageneralmodelforthegeneticanalysisofpedigreedata. HumHered21:523-542. GenomesProjectC,AutonA,BrooksLD,DurbinRM,GarrisonEP,KangHM,Korbel JO,MarchiniJL,McCarthyS,McVeanGAetal.2015.Aglobalreferencefor humangeneticvariation.Nature526:68-74. HeathSC.1997.MarkovchainMonteCarlosegregationandlinkageanalysisfor oligogenicmodels.AmJHumGenet61:748-760. HickeyJM,KinghornBP,TierB,WilsonJF,DunstanN,vanderWerfJH.2011.A combinedlong-rangephasingandlonghaplotypeimputationmethodto imputephaseforSNPgenotypes.GenetSelEvol43:12. HuntKJ,LehmanDM,AryaR,FowlerS,LeachRJ,GoringHH,AlmasyL,BlangeroJ, DyerTD,DuggiralaRetal.2005.Genome-widelinkageanalysesoftype2 diabetesinMexicanAmericans:theSanAntonioFamilyDiabetes/Gallbladder Study.Diabetes54:2655-2662. KongA,MassonG,FriggeML,GylfasonA,ZusmanovichP,ThorleifssonG,OlasonPI, IngasonA,SteinbergS,RafnarTetal.2008.Detectionofsharingbydescent, long-rangephasingandhaplotypeimputation.NatGenet40:1068-1075. 30 bioRxiv preprint first posted online Jun. 9, 2017; doi: http://dx.doi.org/10.1101/148510. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. LanderES,GreenP.1987.Constructionofmultilocusgeneticlinkagemapsin humans.ProcNatlAcadSciUSA84:2363-2367. LathropGM,LalouelJM,JulierC,OttJ.1984.Strategiesformultilocuslinkage analysisinhumans.ProcNatlAcadSciUSA81:3443-3446. LiY,WillerCJ,DingJ,ScheetP,AbecasisGR.2010.MaCH:usingsequenceand genotypedatatoestimatehaplotypesandunobservedgenotypes.Genet Epidemiol34:816-834. LinS,CutlerDJ,ZwickME,ChakravartiA.2002.Haplotypeinferenceinrandom populationsamples.AmJHumGenet71:1129-1137. MakinenVP,ParkkonenM,WessmanM,GroopPH,KanninenT,KaskiK.2005.Highthroughputpedigreedrawing.EurJHumGenet13:987-989. MitchellBD,KammererCM,BlangeroJ,MahaneyMC,RainwaterDL,DykeB,Hixson JE,HenkelRD,SharpRM,ComuzzieAGetal.1996.Geneticand environmentalcontributionstocardiovascularriskfactorsinMexican Americans.TheSanAntonioFamilyHeartStudy.Circulation94:2159-2170. O'ConnellJ,GurdasaniD,DelaneauO,PirastuN,UliviS,CoccaM,TragliaM,HuangJ, HuffmanJE,RudanIetal.2014.Ageneralapproachforhaplotypephasing acrossthefullspectrumofrelatedness.PLoSGenet10:e1004234. SobelE,LangeK.1996.Descentgraphsinpedigreeanalysis:applicationsto haplotyping,locationscores,andmarker-sharingstatistics.AmJHumGenet 58:1323-1337. SunL,DimitromanolakisA.2014.PREST-plusidentifiespedigreeerrorsandcryptic relatednessintheGAW18sampleusinggenome-wideSNPdata.BMCProc8: S23. SunL,WilderK,McPeekMS.2002.Enhancedpedigreeerrordetection.HumHered 54:99-110. TewheyR,BansalV,TorkamaniA,TopolEJ,SchorkNJ.2011.Theimportanceof phaseinformationforhumangenomics.NatRevGenet12:215-223. 31