Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
bioRxiv preprint first posted online Jan. 14, 2016; doi: http://dx.doi.org/10.1101/036681. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. Searchingmoregenomicsequencewithlessmemoryforfastand accuratemetagenomicprofiling SheaN.Gardner1,3,SashaK.Ames2,3,MayaB.Gokhale2,3,TomR.Slezak1,3,JonathanE.Allen1,3+ 1 GlobalSecurityComputerApplicationsDivision CenterforAppliedScientificComputing 3 LawrenceLivermoreNationalLaboratory,LivermoreCA 2 + Addressforcorrespondence:[email protected] Abstract Softwareforrapid,accurate,andcomprehensivemicrobialprofilingofmetagenomicsequencedataona desktopwillplayanimportantroleinlargescaleclinicaluseofmetagenomicdata.Herewedescribe LMAT-ML(LivermoreMetagenomicsAnalysisToolkit-MarkerLibrary)whichcanberunwith24GBof DRAMmemory,anamountavailableonmanyclusters,orwith16GBDRAMplusa24GBlowcost commodityflashdrive(NVRAM),acosteffectivealternativefordesktoporlaptopusers.Wecompared resultsfromLMATwithfiveotherrapid,low-memorytoolsformetagenomeanalysisfor131Human MicrobiomeProjectsamples,andassesseddiscordantcallswithBLAST.AllthetoolsexceptLMAT-ML reportedoverlyspecificorincorrectspeciesandstrainresolutionofreadsthatwereinfactmuchmore widelyconservedacrossspecies,genera,andevenfamilies.Severalofthetoolsmisclassifiedreadsfrom syntheticorvectorsequenceasmicrobialorhumanreadsasviral.Weattributethehighnumbersof falsepositiveandfalsenegativecallstoalimitedreferencedatabasewithinadequaterepresentationof knowndiversity.OurcomparisonswithrealworldsamplesshowthatLMAT-MListheonlytooltested thatclassifiesthemajorityofreads,anddoessowithhighaccuracy. Introduction Recentstudiesshowthatthemicrobiomeplaysanimportantroleinthehealthofhumans,animals,and naturalandagriculturalsystems.[1-4]Metagenomicsequencingofhumanmicrobiomeshasalready contributedtodiagnosingandtreatingsickpatients[5],andispoisedtoplayamuchlargerrole, providedthatthetechniquecandeliveraccurateandtimelyanalysisofmulti-gigabasesofunassembled reads.Metagenomicanalysistypicallydemandssubstantialcomputingresources,eitherintermsofCPU ormemory,orboth,andruntimescanexceedthetimeforsequencing.[6]Asinstitutionsinvestin sequencinginfrastructure,theymaynothaveaparallelcapabilitytoinvestandmaintainlargecompute clusters,andissuesofpatientprivacyordatatransferbottlenecksmaydiscouragecloudorcentralized analysis.Forlargedatasets,runningBLASTanalysisonAmazon’sEC2cloudwasseveraltimesmore expensivethanthesequencingitself,andcostsofsequencingaredecliningfasterthanthoseof computing.[7]Rapid,sensitive,andaccuratemethodsoftaxonomicclassificationofthesample contentsthatcanrunonrelativelowpricedesktopmachinespromiseasolution. Toachievethegoaloffast,accuratemetagenomicanalysis,variousmetagenomeanalysissoftware packagesreducetheoriginalsequencedatabasetoasmaller,moreeasilysearchablemarker-library containingataxonomicallyinformativesubset.Metaphlan2[8]matchesreadstoasmallsetofmarker genes,single-copygenespresentinmanybacteria,orclade-specificgenestodotaxonomicclassification bioRxiv preprint first posted online Jan. 14, 2016; doi: http://dx.doi.org/10.1101/036681. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. andabundanceestimationwithoutattemptingtoclassifyallreads.Kraken[9]matchesthek-mers (k=31)inareadtothoseinareferencedatabase.Itpre-computesthelowestcommonancestor(LCA)of referencesequencescontainingeachk-mer,andappliesatreetraversalalgorithmtotaxonomically labelareadfromitsk-mers.ThefullKrakendatabase,however,doesnotfitinmemoryoncommon desktopcomputingsystems.TheMiniKrakendatabasewascreatedasalessmemoryresourceintensive alternativeforrunningondesktopsystems.ClinicalPathoscope[10]aimstoidentifypathogensin clinicalsamplesbyalignmenttoNCBIbacterialandviralreferencegenomesandBayesianstatistical confidenceestimation.GOTTCHA[11]mapsreadstoapre-computeddatabaseofuniquesubsequences atmultipletaxonomicranks(family,genus,species,strain,etc.).SIANN[12]alsomapsreadstoaprecalculateddatabaseofspecies-andstrain-specificregionsofpathogensandtheirnearneighbors.Unlike theothermethodsmentionedhere,SIANNwasdesignedforaspecifictaskofrapidlyassessingwhether anymembersofadefinedsetofpathogensispresentinametagenomicssample.Itisnotageneralpurposemetagenomicstool.SURPIpresentsanotherrecentpathogendetectionsystem,whichusesan approachsimilartoClinicalPathoscopebymappingreadstoreferencegenomesfororganism identification,butitrequiresmorememory(60GB)thanisavailableonatypicaldesktop.[5] LMATusesareferencegenomedatabasethatcontainsbothdraftandfinishedgenomesfrombacteria, archaea,viruses,andsomeeukaryotesincludingpathogenicprotozoa.Thissetofreferencegenomesis morethan11-foldlargerthananyothermetagenomeanalysisreferencesequencedatabase.[13,14] LMATindexeseveryk-mer(k=20)inareferencegenomedatabasewithallofthesequencescontaining thatk-mer.Itimplementsa“pruning”strategytoretainonlyhigherleveltaxonomiclabelsfork-mers sharedbymorethansomepre-specifiednumberofsequencesdownataxonomicbranch.Itstillretains multipletaxonomicnodesandgenomesperk-mer,andthusmorecompleteinformationaboutwhich sequencescontainak-merthanispossiblewithanLCAapproach,whichstoresonlyasingletaxonomic nodeperk-mer.Thisallowshigherresolution(e.g.speciesandstrain)callswhenthedatawarrants.The fulldatabase(LMATGrand)requires500GBofDRAMorflashmemorysoisnotfeasiblefor“desktop”or typicalclusterusers.LMAT-Grand’sextensiverepresentationofgenomicvariationleadstolabelinga largefractionofreads,whichisusefulforsomeapplications(suchasreadbinningforassembly)butis morecomputationallycostlythanmaybeneededfororganismidentification.TheLMATMarkerLibrary (ML)reducestheRAMrequirementsbypreselectingonlythemosttaxonomicallyinformativeandnonoverlapping(i.e.non-redundant)20-mersforindexing,andbyimposingmorestringentpruning.Thus, memoryrequirementsofLMAT-MLarereducednotbylimitingthetaxonomiccoverageandstrain resolutionofthereferencedatabase,butbypre-selectingthesubsetofk-merswiththehighest taxonomicinformationcontent.Moreover,themarkerlibraryapproachhasthepotentialtorunatleast anorderofmagnitudefasterbycorrectly“ignoring”thelesstaxonomicallyinformativeportionsofthe queryset.PartoftheworkwepresenthereevaluatedseveralLMAT-MLpruninglevelstodeterminethe optimalbalancebetweenmemory,speed,andconsistencyofresultscomparedtoLMATGrand.Wealso comparedanLMAT-MLthatcontainedonlymicrobialk-merstoLMAT-ML+Hthatalsocontainedallthe humank-mersinLMATGrand. Eachoftheabovemethodsmakesdifferenttradeoffsintermsofmemory,speed,sensitivity,and accuracy.Wechosetocomparethesemarkerlibrarymethodsusingdatafromactualmicrobiomedata forseveralreasons.Previousstudieswithsimulateddatasetsensurethatthesimulatedreadscome fromorganismsorcloserelativespresentinthereferencedatabase,whichcannotbeassumedinreal samples.Constructingrealistic,robustsimulatedmetagenomicbenchmarkingdatasetsremainsa fundamentalchallengeandwillbenefitfromemergingcommunityeffortstoconstructresources availableforthirdpartyvalidation.Whiletheseresourcesdevelop,ourgoalwastocomparetools, whichcouldberunwith16GBofRAMorlessonanimportanttargetsubsetofmetagenomics–human bioRxiv preprint first posted online Jan. 14, 2016; doi: http://dx.doi.org/10.1101/036681. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. metagenomicsusing131HumanMicrobiomeProject(HMP;http://hmpdacc.org/HMASM/)samples randomlyselectedtospanbodysitesandgenders,andcheckdiscordantresultsbetweenmethodswith BLAST.ThesesamplesmaycontainmanyspeciesnotincludedinNCBIRefSeq (http://www.ncbi.nlm.nih.gov/refseq/),orrepresentedonlyasdraftgenomes,aswellassomethathave notyetbeensequenced.Althoughwedonotknowgroundtruth,welookedatcasesofdisagreement betweentheLMAT-Grandmethodthatqueriesthemostcomprehensivereferencedatabaseandeachof themarkerlibrarybasedmethods.Weexaminedthereadsresponsiblefordiscordantspeciescallsusing BLAST[15]searchestoassessiftheyweremostlikelyfalsenegativesforonemethodorfalsepositives fortheother.WereportonthespeedandaccuracyofthesemetagenomeanalysistoolsforHMP samples.Ourevaluationconsidersahardwareconfigurationof16GBDRAMandalowcostcommodity NVRAMintheformofasolid-statedrive(SSD),whichshouldbeaccessibleforusewithexistingdesktop computers.TheLMATMLdatabasesrangefromapproximately13GBto19GBinrequiredstorage;thus, thisrangecoversbothfittinginandexceedingtheavailableDRAM.Thisrangeallowsustomeasure impactofdatabasesizeonLMATclassificationperformance. MaterialsandMethods BuildingLMAT-ML TheLMATreferencegenomedatabaseincludes1)eukaryoticsequenceoffungi,protozoa,andsome multicellularorganisms(fromorganelleslabeledaswholegenome,e.g.mitochondriaandchloroplasts); 2)draftgenomesandassembledcontigsfromunfinishedWholeGenomeShotgun(WGS)genome sequencingprojects(ftp://ftp.ncbi.nih.gov/genbank/wgs/);and3)draftandfinishedbacteria,virus, archaea,fungi,andprotozoagenomesfromanumberofsequencingcentersworldwidewithpublicly availablesequencedatainadditiontothosefromNCBI1.Anextensivecollectionofartificialvector sequence[17]isalsoincludedtofilteroutcontaminatingsequenceindraftassemblies.Thefungiand protozoasequencedatacamefromtheFungiandProtistGroupBioProjectsreportedintheNCBI eukaryotesgenomereport(ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/eukaryotes.txt), andBioProjectsequenceswereextractedbytheassemblyaccession. Toselectthek-mersforinclusioninLMAT-ML,weusedtheproceduredescribedin[14],usingk=20 insteadof18,andaddinganextrasteptoremoveoverlappingk-mersrelativetoareferencesequence (randomlyselectedfromthosecontainingaseriesofadjacentk-mers).Briefly,theobjectivewasto identifyacollectionofk-mersthatareuniquelyassociatedwithphylogeneticallydistinctsetsofgenomic sequence.Groupsofgenomesaredefinedbytheirsharedk-merswithaminimumof200k-mersshared withinagroupforviralgenomesand1000k-merssharedwithinanon-viralgroup.Theminimum thresholdsweresettomaintaingroupsofgenomesthatretainsomedegreeofphylogenetic relatedness.Anyk-merfoundinmorethanonegroupwaseliminatedtoyieldasetofk-mersthatare uniquelyassociatedwithdifferentlevelsofthetaxonomyhierarchy.Additionally,k-mersmatching RepBase18.06[16]wereeliminated.Sincetheresultingk-mersetstillyieldedadatabasewithalarger memoryfootprintthanwouldbepracticalforadesktoporlaptop,k-mersweremappedtoarandomly selectedrepresentativesequencefromthegenomegrouptoremovemultipleadjacentoverlappingk 1 SangerCenter,J.CraigVenterInstitute,BaylorCollegeofMedicine,WashingtonUniversityin St.Louis,BeijingGenomeInstitute,IntegratedMicrobialGenomes,EuropeanMolecularBiology Laboratory,etc. bioRxiv preprint first posted online Jan. 14, 2016; doi: http://dx.doi.org/10.1101/036681. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. mers.Next,synthetick-mersfromLMAT-Grandwereaddedtodetectsynthetic/vectorsequences. Finally,a“+H”versionoftheLMAT-ML’swascreatedtousethehumank-mersfromLMAT-Grandto accuratelyclassifyhumanhostsequences,whilemaintainingsmallmemoryrequirements. Weevaluatedtwotaxonomypruningstrategiestoimprovespeedandreducethememory requirements,asdescribedindetailin[13].Evaluatingthesepruningstrategiesisnecessarytoassess thepossiblecostinlostaccuracywhileimprovingspeed.Suchanevaluationisanimportantstepin understandingthepotentialforLMAT-MLsincethepruningstrategyisauniquepropertyofLMAT’s approachtoclassification,incontrasttoothertools.Asabaseline,whennopruningisused,indicated as“–All”,everytaxonomyidentifieroflowestavailablerank(e.g.speciesorstrain)foragivenk-meris retainedinthesearchabledatabase.The–Minpruningoption(LMAT-ML-MinandLMAT-ML+H-Min) storesonlythelowestcommonancestor(LCA)foreachk-mer,similartotheapproachemployedby Kraken. Thealternativepruningoptiontestedstoresamaximumof10taxonomyidentifiersperk-mer(LMATMLandLMAT-ML+H).LMATdatabaseswiththe“+H”labelincludeallhumank-mersinadditiontothe microbialk-mers.Inthisoptioneachk-merislinkedtoasetoftaxonomicidentifiersthatcontainthe lowestcommonancestorforallsequencescontainingthek-merandupto9descendentidentifiers, whichmustretainacommonrank.Forexample,iftheLCAisofrankgenus,andthek-merisfoundin nineorfewerdistinctspecies,thenalldistinctspeciesidentifierswouldberetained.Ifthek-meris foundinmorethanninespecies(butonlyonegenus)thenonlythegenusidentifierwouldberetained. TheLMATdatabaseusesatwo-levelindexdatastructuredescribedindetailin[19]toimprovethe efficiencyofk-mersearch.Ak-merisrepresentedbytwonon-overlappingbitvectors,witha20-mer representedby40bitsandthefirstNbitsstoredinthefirstleveloftheindexandthesecond20-N lowerorderbitsstoredinthesecondlevel.ThechoiceofNwasoptimizedforusewiththeML.Asplitof N=25bitswasselected,whichreducedthesizeoftheMLdatabaseby1.5GB,comparedwithprevious settingsdevelopedforusewiththefulldatabase.AnewextensiontotheLMATsoftwarewasadded(v 1.2.5),enablingthetwo-levelsplittobespecifiedforthetargetdatabase.Thisfeatureallowseach databasetobeuniquelytunedforefficientuseofspaceadjustingforthesizeofthedatabase.Thesplit parameterwasadjustedfromtheprevioussettingusedforthelargerdatabase(LMAT-Grand)todeploy thesmallermarkerdatabasesonalowcostSSDdevice. Markerlibrarycomparisons Weusedasetof131HMPsamplesrandomlychosentospanallbodysitesforbothgenders(Additional file1:131HMP_Samples.xlsx),withpreprocessingtotrimnon-biologicalportionsofreads(adaptors), trimorreplacelowqualitybases(Q<10)withN,andcombinedpairedendsequencesasdescribedin [14].WecomparedresultsfromClinicalPathoscopev1.0.3,Metaphlan2(db_v20),GOTTCHA (downloadedSept.2,2014),MiniKraken(kraken-0.10.4-beta),SIANN(v1.12),theLMAT-ML+H,andthe LMATGranddatabase.Eachmethodwasrunwithdefaultparameters.ClinicalPathoscopetarget databaseswerebacteriaandvirus,andhostfiltrationdatabasewashuman.GOTTCHAwasrunagainst theGOTTCHA_BACTERIA_c3514_k24_u24_xHUMAN3x.speciesand GOTTCHA_VIRUSES_c3498_k85_u24_xHUMAN3x.speciesdatabases,microbialdatabasesinwhich24mersmatchingthehumanreferencegenomehadbeenremoved,andresultswerecombinedfor bacteriaandviruses.ResultsfromallLMAT-MLswithnoormoderatepruningwerenearlyidenticalin termsofconsistencywiththeLMAT-Granddatabase,whileresultswereslightlyworseforLMAT-ML-Min withouthumank-mers(FigureS1),soforallthemarkermethodcomparisonsweusedLMAT-ML+H.We bioRxiv preprint first posted online Jan. 14, 2016; doi: http://dx.doi.org/10.1101/036681. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. uniformlyappliedthedefaultcutoffvaluesof0.5forLMAT-Grandand0.2forLMAT-ML+H.Werequired aminimumof100readsperspeciestocallthatspeciesaspresentforeachofthemethods,and observedthatresultsweresimilaracrossthresholdsof50-2000reads(resultsnotshown).Readsfor methodsthatreportedcallsaspercentagesofmappedortotalreadswereconvertedtoabsolute numbersofreads.ForMetaphlan2,sincetherelativeabundancewasnotstraightforwardtoconvertto readcounts,weusedthepercentabundancemultipliedbythetotalnumberofreadsmappedasa proxy.WeidentifiedtheNCBIspecies-leveltaxonomyidandNCBIspeciesnameforallspeciesandstrain callsbyeachmethod,anontrivialprocesssincesomemethodsreportnon-standardspeciesnamesand notaxonomyidentifiers,andusedeprecatedGInumbers,whicharenotinthecurrentNCBIdatabases. Foreachmethod,the10mostcommonlycalledspeciesdetectedbythatmethodandnotbyLMAT- Grandwereidentified,andthereadsidentifiedbythatmethodasbelongingtothatspecieswere extractedbasedonparsingthebowtie2orSAMoutput,identifyingthetaxonomybyGInumber,NCBI geneID(Metaphlan2),orasalastresortbyorganismname,usingNCBItables2.Weacknowledgethat wemayhavemissedextractingsomereadswhichagivenmethodusedforclassification,sincethese methodsdonotdescribestandardizedproceduresforreadextractionandcallvalidation.For Metaphlan2inparticular,wewereunabletoextractasmanyreadsasweexpectedbasedonreported percentageabundances.Readswerecombinedintoasinglefileperspeciesforeachmethod.These readswerethencomparedusingBLAST(blastn-evalue0.0001-max_target_seqs5)toour comprehensivedatabaseofallbacterialandviralsequences,andtheoutputprunedtoshowonlythe matchestoeachreadwiththeloweste-values,allowingmultiplematchesperreadwiththesame loweste-value.ThesereadswerealsocomparedusingBLASTtoacomprehensivedatabaseofvectors andotherartificialsequences,containingthesequencesinUniVecaswellasIlluminaadaptorsanda numberofcommerciallyavailablevectorsequences[17].ThereadswerealsocomparedusingLMAT withtheGranddatabase,andthetaxonomiccallwithhighestreadcountwasreported. The10mostcommonspeciescallsmadebyLMAT-Grandandnotanothermethodweregatheredfor eachmethod,ignoringHomosapiensandLMAT-Grandclassificationforsyntheticconstructs,which LMAT-Grandreportsasa“species”.Thetop10listwasidenticalforClinicalPathoscope,GOTTCHA,and MiniKraken,whichalluseNCBIRefSeqtobuildtheirreferencedatabases.Readsforthesespecieswith LMAT-Grandscoresofatleast1wereextractedandcomparedusingBLASTagainstourcomprehensive viral/bacterialgenomedatabase,andinonecasewhereaprotozoawasuniquelydetected,againstour protozoadatabase. Results LMATdatabasesencode11-147timesmoresequencedatarepresentingapproximately2-4timesmore speciesthanothermethods(Table1).Thememory(DRAMorDRAM+NVRAM)ofLMAT-MLisalso higherthanforothermethods,althoughstillfeasibleforadesktop,particularlywhensupplemented withlowcostNVRAM. 2 ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz, ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz, ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz bioRxiv preprint first posted online Jan. 14, 2016; doi: http://dx.doi.org/10.1101/036681. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. Numberof speciesin reference Method databaseⱡ ClinicalPathoscope 4280 Metaphlan2 7147 GOTTCHA 3461 MiniKraken 5538 Reference Genome DatabaseSize (Gbases) Memory required (GBytes) 7.8 0.76 Notavailable* 9.97 4.6† 3† 8® 4¥ 16+24 NVRAMor24 112.10** DRAM LMAT-ML 12632 Table1:Referencegenomedatabasesizeandnumberofspeciesrepresentedbyeachmethod. ⱡSpeciescountsdeterminedbycountingtheuniqueNCBIspeciesidentifiersfromallsequencesinthe referencedatabase. *Originalreferencesequencedatabasewasnotprovidedwithdownload,onlytheprecomputed, compiledmarkerlibraryofk-mers. **Thisdoesnotincludetheadditionalk-mersaddedfromthe1000HumanGenomesProject,since thesewerefromunassembledgenomesasdescribedin[13]. †Estimatedusing/usr/bin/time,correctingforthebuginGNUtime (http://stackoverflow.com/questions/10035232/maximum-resident-set-size-does-not-make-sense)by dividingthemaximumresidentsetsizeby4. ¥Reportedby[9] ®Reportedbyhttps://github.com/poeli/GOTTCHA Runtimeperformanceisdeterminedbydatabasesize Table2showsthesizeofthedifferentMLdatabasesreflectingthedifferentpruningstrategies(shown as1,10,andAllinthirdcolumnofthetable).Althoughtherewaslittledifferenceintaxonomic classificationsbetween4ofthe5LMAT-ML(FigureS1),thechoiceofpruninglevelaffectsmemory requirements.Forcomparison,LMAT-MLsincludeapproximately4.4-4.7timesmorek-mersthan MiniKrakenbutrequire3-4.7timesmorememory.However,LMAT-ML-Minmaintainsalowerk-mer perbyteratiothanMiniKraken(7.6versus11.2),whichislikelyexplainedbyLMAT’suseofasmallerk (20versus32forMiniKraken). bioRxiv preprint first posted online Jan. 14, 2016; doi: http://dx.doi.org/10.1101/036681. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. Method Containshuman k-mers? MiniKraken LMAT-ML-Min LMAT-ML LMAT-ML+H LMAT-ML-All LMAT-ML+H-All - no no yes no yes Maximum taxonomyids storedperkmer 1 1 10 10 All All size(GB) K-mers 4.0 12.1 15.9 16.8 18.1 18.9 357,913,941* 1,586,405,299 1,586,405,299 1,697,066,355 1,586,405,299 1,697,066,355 600000 Metaphlan2 LMAT−ML−ramfs LMAT−ML−Min LMAT−ML−Human−All LMAT−ML−Human LMAT−ML−All LMAT−ML Kraken GOTTCHA Clinical Pathoscope 0 200000 bp/(s*cpu) 1000000 Table2:Databasesizesandk-mercounts(forLMAT)givenconfigurationsthatvarythepresenceof humanreferencek-mersandthemaximumcountoftaxonomyidslistedperk-mer.*Forcomparison, weincludeanestimateofMiniKraken’sk-mercountbasedon12bytesperk-merasprovidedbythe authors.[9] Figure1:ProcessingrateforLMATusing5databaseconfigurationsandfouradditionalprocessing methods,reportedperCPU.5ofthe6LMAT-MLrunsuseSATAIISSDforstorage.LMAT-ML-ramfs,the LMATdatabaseisstoredinmainmemoryona“ramdisk”(linux“ramfs”)filesystem.Inthisandthe followingfigures,MiniKrakenislabeledsimplyas‘Kraken’. bioRxiv preprint first posted online Jan. 14, 2016; doi: http://dx.doi.org/10.1101/036681. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. Figure1showsprocessingratesforLMATcomparedwiththefourclassificationtools.Eachboxplotbar encompassestheresultsfrom8HMPsampleruns.FortheLMATrunswehaveconfiguredfivemarker libraries,andquerythemusing16GBofDRAMandaSATAIIOcz500GBsolid-statedrive.Tocompare performancewhenstoringthedatabasecompletelyinDRAMmemory,weshowtheperformanceofthe LMAT-MLdatabaseonacomputeplatformwith24GBofDRAM,wherethedatabaseindexcanresidein mainmemorywithoutpagingfromtheexternalstoragedeviceusingaramdisk(linuxramfs),as indicatedbylabelLMAT-ML-ramfs.Fromtheseresultswemakeseveralobservations. (1) MiniKrakenhasapproximately5timesfasterperformancethanourbaselineLMAT(atmost10 taxonomyidentifiersperk-mer).Itsdatabasesizeisconsiderablysmallerat4GBincontrastto theLMATdatabases(Table2);thus,itshouldhavegreaterCPUcachehitratesonaverage, whichcanhaveaprofoundimpactonperformance. (2) LMAT-ML-Min(singetaxonomyidentifierperk-mer)showsfasterperformancethanour ramdiskrun(LMAT-ML-ramfs).Thisidentifiestheincreaseincostofprocessingmoretaxonomy identifiers. (3) LMAT-ML(SSDconfiguration)iswithin4%ofLMAT-ML-ramdisk.Whileweestimatethatas muchas15%ofthedatabasewouldnotbeavailableinbuffercacheatanymoment,thisresult indicatesthatthisratioofdatabasesizetototalDRAMdoesnotcreateasubstantialburdenon thesystem’scachingmechanism. (4) TheLMAT-ML+HandbothLMAT-ML-Alldatabasesshowslowerperformance.Inallcases,the largerdatabasesizemeanslessofitcanfitinbuffercache,thusrequiringmoreNVRAMaccess operations.Theadditionofhumanappearstohaveagreaterimpactonperformancethanthe increaseintaxonomyidentifiersperk-merindicatingthatthenumberofdistinctk-mersstored intheindexnecessitatesmoreNVRAMaccessoperations. (5) Metaphlan2isfasterthanallLMAT-MLinstances,butLMAT-ML-Miniswithin20%. (6) AllLMATconfigurationsoutperformGOTTCHAandClinicalPathoscopeintermsofspeed, althoughClinicalPathoscopeisclosetotheLMATconfigurationsthatcontainhumanreference information. Comparingtheorganismsdetectedbyeachmethod SIANNfoundnoneoftheorganismsinitsdatabaseforanyoftheHMPsamples.Wemanuallysearched forseveralpathogensthatwereincludedintheirdatabaseandwhichweredetectedbyboth Metaphlan2andLMAT-GrandbutnotSIANN,andtheseincludedClostridiumsymbiosum,Burkholderia cenocepacia,Staphylococcusepidermidis,S.caprae,S.hominis,S.lugdensis,andS.aureus.Weraniton severalsampleswithknownspikedpathogenconcentrations[18]toverifythatwewererunningthe softwarecorrectly.Inasamplewith10,000genomeequivalents(GE)ofBacillusanthracis,Burkholderia pseudomallei,Francisellatularensis,andYersiniapestisandanunknownconcentrationofBrucella spikedintoabackgroundofhumanDNAandsequencedonIonTorrent,SIANNdetectedthe5spiked pathogenspecieswithhighconfidencebutwasunabletomakeacorrectstraincall.Italsomade2false positivespeciescallsfororganismsthatwerenotpresent(FrancisellanovicidaandFrancisellacf).At100 GEspikein,onlyanincorrectstrainofF.tularensiswasdetectedwithlowconfidence,andnoneofthe otherpathogenspresentweredetected.LMATGrandandLMAT-ML+Hdetectedall5pathogensat 10,000and100GE.LMATcalledthevastmajorityofreadsatthespecieslevel(exceptforBrucella genus)indicatingtheseregionsareconservedacrossmultiplegenomes.LMATwith10,000GEcorrectly calledY.pestisHarbinasthetopstrain,andintheothercasesthetopstrainwasamongthetop3 bioRxiv preprint first posted online Jan. 14, 2016; doi: http://dx.doi.org/10.1101/036681. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. genomeswiththemostgenome-specificreads.WedidnotinvestigateSIANNfurther,asitwasnot designedtobeageneralpurposemetagenomicsanalysistool. Animportantcomponentofidentifyingtheorganismspresentinametagenomeismeasuringthe numberofreadsleftunclassified.Evenasmicrobialdiversityingenomicdatabasescontinuestogrow, thepotentialfornovelgenesororganismstoremain‘hidden’inthesampleremainshigh.Thus,the abilitytoreducethenumberofreadsthatmustbeconsideredforassemblyorothermoreindepth analysisthroughtimeconsumingsensitiveproteinsearcheswithBLASTorprofileHiddenMarkov Modelsmustbeconsideredwhenevaluatingthecompletenessofamethod'staxonomicprofiling. LMAT-Grandclassifiedonaverage83%ofthereadspersample,followedbyLMAT-ML+H(63%), MiniKraken(35%),ClinicalPathoscope(29%),GOTTCHA(14%),andMetaphlan2(5%)(Figure2A).LMAT- Granddetectedanaverageof178species/sample,LMAT-ML+H154species/sample,MiniKraken108 species/sample,ClinicalPathoscope67species/sample,Metaphlan242species/sample,andGOTTCHA 27species/sample(Figure2B). Figure2A:Boxplotsshowingthefractionofreadsthateachmethodclassifiedfromthe131HMP samples. bioRxiv preprint first posted online Jan. 14, 2016; doi: http://dx.doi.org/10.1101/036681. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. Figure2B:Boxplotsofthenumberofspeciesthateachmethodclassifiedfromeachofthe131HMP samples.LMATGranddetectedonaverage11%morespeciesthanLMAT-ML+Hand57%morespecies thanMiniKraken,thenexthighestmethod.BLASTresultssuggestalargenumberofMiniKrakencallsare falsepositives. TaxonomycallsmadeusingLMAT-Granddonotrepresentacompletegroundtruthoforganisms presentinasample.Sincethedatabasedrawsonthemostcompletecollectionofsequencedgenomes amongdatabasesconsidereditisusefultomeasuretherelativeconcordanceamongthedifferent methodsrelativetoLMAT-Grand.TheoverlapbetweeneachmethodandLMAT-Granddiffered substantiallyasillustratedinFigure3,summingspeciescallsinagreementorindisagreementwiththose ofLMATGrandacrossthe131samples.MiniKrakenandClinicalPathoscopeshowedrelativelylittle overlapwithLMAT-Grandspeciescalls,GOTTCHAandMetaphlan2overlappedbythemajorityoftheir speciescallsbutonlycoveredasmallfractionoftheLMATGrandcalls,andLMAT-ML+HandLMAT Grandagreedalmostentirely. bioRxiv preprint first posted online Jan. 14, 2016; doi: http://dx.doi.org/10.1101/036681. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. Figure3:Venndiagramsillustratingthetotalnumberandoverlapofspeciescallsforthe131HMP samplesbyeachmethodwithLMATGrand,inorderofincreasingoverlap. ThecallsbyGOTTCHA,MiniKraken,Metaphlan2,andClinicalPathoscopesharehighersimilaritywith thoseofLMAT-Grandatthegenuslevelthanatthespecieslevel,althoughneitherspeciesnorgenus agreementisveryhigh(Figure4).ToconsiderthepossibilitythattheLMAT-Grandoutputwaserror prone,adetailedanalysisofthediscordantcallswereexaminedusingBLASTalignmentsagainsta comprehensivemicrobialdatabase. bioRxiv preprint first posted online Jan. 14, 2016; doi: http://dx.doi.org/10.1101/036681. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. Figure4:Thefractionofspecies(circles)orgenera(crosses)detectedbyLMATGrandthatwerealso detectedbyagivenmethodversusthefractionofspeciesdetectedbyagivenmethodthatwerenot detectedbyLMATGrand,averagedacrossthe131HMPsamples.IfthecallsbyLMATGrandarecorrect asindicatedbyBLASTresults(seebelow),thenthisisakintosensitivityversusfalsediscoveryrate. CallsmadebyothermethodsthatwerenotsupportedbyLMATGrand AllthemarkerlibrarymethodsexceptLMAT-MLdetectedanumberofspeciesineachsamplethat LMAT-Granddidnot(Figures3and4).Readswereextractedfromagivenmethod’smappingresultsfor the10mostcommonlyoccurringspeciescallsthatdifferedfromLMATGrand.Figure5showsthesum oftheseextractedreadsacrosssamplesandspecies,totalingbetween10,000togreaterthan1Mreads fornon-LMATmethods.WithordersofmagnitudefewerreadsforLMAT-ML+Hpotentialfalsepositives, manualinspectionrevealedthatthedifferencebetweenLMAT-ML+HandLMAT-Grandlayinminor quantitativedifferencesclosetothethresholdreadcountforcallingaspeciesaspresentratherthan qualitativelydifferentcalls.BLASTsearcheswiththesereadsagainstthecomprehensivemicrobial genomedatabaseprovidedevidenceinsupportorcontradictionofthepotentialfalsepositives, summarizedinFigure6anddescribedindetailforeachmethod. bioRxiv preprint first posted online Jan. 14, 2016; doi: http://dx.doi.org/10.1101/036681. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. Numberofreadsfortop10most commonlycalledspeciesnot detectedbyLMAT-Grand 10,000,000 1,000,000 100,000 10,000 1,000 Kraken Clinical GOTTCHA Metaphlan2 LMAT-ML+H Pathoscope Figure5:Numberofcandidatefalsepositivereadstotaledacrosssamplesforthetop10mostcommon speciescallsthatwerenotdetectedbyLMATGrand. Figure6:FractionofbestBLASTmatchestospeciescalledbytheindicatedmethodthatwerediscordant withspeciescallsbyLMAT-Grand. bioRxiv preprint first posted online Jan. 14, 2016; doi: http://dx.doi.org/10.1101/036681. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. Thesewereforthemostcommonlycalledspeciesbytheindicatedmethodthatwerenotdetectedby LMAT-Grandinthosesamesamples.Inallnon-LMATmethods,onlyaminorityofBLASThitstoa comprehensivesequencedatabasesupportedthespeciescalledbythatmethod,suggestingthatthese wereincorrectspeciescalls.CallsbyLMAT-ML+Hhadamajorityofmatchestothecorrectspecies,on average,butnearlyasmanymatchestootherspeciesinthegenus,supportinggenuslevelcallsforsome ofthesereads.BLASTresultsprovidedstrongsupportforLMAT-Grandcallsthatwerenotdetectedby othermethods.Amoredetailedanalysisofthediscordantcallsforeachmethodisnowgiveninthe followingsections. MiniKrakenunsupportedcalls MiniKrakenmadeanumberofunsupportedcallsthatwererepeatedlyseenacrosshalformoreofthe samples.ForthemostcommonMiniKrakenspeciescallsthatwerenotconsistentwithLMATspecies calls,7ofthe10had5%orfewerBLASThitstothespeciesidentifiedbyMiniKraken.Theother3still hadaminoritywith19-38%ofBLASTmatchestotheMiniKrakenspecies,andthemostcommonLMAT classificationforthesereadswasatthegenuslevelorabove.TheunsupportedMiniKrakenspeciescall withthebulkofreads,over2million,had99.5%ofreadsmatchingthedatabaseofsyntheticsequences. TheseBLASTresultsindicatethatMiniKrakencallswereeitherincorrectand/oroverlyspecific,and failedtocorrectforcontaminationofreferencesequenceswithvectorandotherartificialsequence contaminants(SupplementaryTableS2). ClinicalPathoscopeunsupportedcalls ClinicalPathoscopecommonlycalledanumberofbacteriainnearlyhalfthesampleswhichwerenot supportedbyLMAT-Grandcalls.OfreadsmappedtoorganismscalledbyClinicalPathoscopebutnot LMAT-Grand,mosthadfewerthan1%ofBLASThitstothisorganism,suggestingthatorganismswiththe highestsimilaritytothesereadswerenotintheClinicalPathoscopedatabase,sothereadsweremisattributedtootherspecieswithhomology(SupplementaryTableS3).Thereweretwoexceptionsfor whichmorethan10%oftheloweste-valueBLASTmatchesweretotheorganismcalledbyClinical Pathoscope:Propionibacteriumacneswith28%,andBacteroidesvulgatuswith17%,althoughtheseare stillaminorityofBLASTmatches.LMAT-GrandclassifiedmostofthoseP.acnesreadsas genus,Propionibacterium,order,Actinomycetales,class,Actinobacteria,andcellularorganisms.LMATGrandcalledthemajorityofthoseB.vulgatusreadsattheBacteroidesgenuslevel,inadditiontosome thousandsofreadstoothergenera,orderBacteroidales,phylumBacteroidetes,andsuperkingdom Bacteria.ThefactthataminorityofBLASThitsaretotheorganismidentifiedbyClinicalPathoscope suggestthatcallsbyClinicalPathoscopemaybeoverlyspecific,withequivalentsimilaritytomany species.Inaddition,upto19%ofthereadsforsomeoftheseunsupportedClinicalPathoscopecallshad BLASTmatchestoartificialsequences,indicating,thismethodfailstoadequatelycontrolfor contaminationofthereferencedatabasebyartificialsequencessuchasvectorsandadaptors. GOTTCHAunsupportedcalls PhagedominatedthecallsuniquetoGOTTCHA.AllthereadswhichGOTTCHAlabeledasEnterobacteria phagelambdamatchedourvectorandothersyntheticsequencesdatabase,andLMATlabeledalmostall thesereadsassyntheticconstructsorroot(SupplementaryTableS4).23%ofthereadsthatGOTTCHA calledasEnterobacteriaphagephiX174sensulatohadBLASTmatchestoourvectordatabase,andLMAT labeledthemasroot,superkindgomBacteria,syntheticsequences,orcellularorganisms.Elementsof thesephageareusedforgeneticengineeringandascontrolsincertainIlluminasequencingprotocols, bioRxiv preprint first posted online Jan. 14, 2016; doi: http://dx.doi.org/10.1101/036681. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. soweincludetheminthevector/syntheticdatabase,sothisresultisnotsurprising.Whiletheothercalls didnothavemanyBLASTmatchestoartificialsequences,onlyaminorityofthetopBLASThitsmatched theGOTTCHAcallsincethesereadswerewidelyconservedacrossspeciesorhigherranks,andLMAT identifiedthesereadslargelyasrootorgenuslevelcalls. Metaphlan2unsupportedcalls Therewere4organismscommonlycalledbyMetaphlan2thatwereunsupportedbyLMAT-Grandcalls: DasheenMosaicvirus,Streptococcussp.GMD4S,Catonellamorbi,andAbiotrophiadefectiva,plussome otherunsupportedcallsthatoccurredinjustafewsamples(SupplementaryTableS5).Incontrastto ClinicalPathoscope,Metaphlan2uniquecallshadveryfewmatchestoartificialsequence.Readsthat Metaphlan2uniquelylabeledasDasheenmosaicvirusandViciacrypticvirushadnobestBLASTmatches totheseviruses,promptingustodoublecheckthattheseviruseswereindeedpresentinourBLAST database.LMATcallsforthesereadswerepredominantlytoHomosapiens,plussmallnumberstohigh levelclassificationssuchastaxonomyrootnode,cellularorganisms,andvariousbacterialgenera.For someoftheorganisms,48%-68%oftheBLASTmatchesweretothecorrectorganism,buttherewereas manymatchesofthesamequalitytootherorganisms,suggestingoverlyspecificMetaphlan2callsto readsconservedatthegenusorphylumlevel.Supportingthisobservation,LMATcallsforthose particularreadswereoverwhelminglyatthephylumandgenuslevels,andevensometothekingdom Bacteriaandtocellularorganisms.ForotherMetaphlan2speciescalls,LMATGrandclassifiedmoreof thosereadstoseveralnearneighborspeciesinthesamegenus,withonlyasmallsubsetofthosereads classifiedasthesamespeciesasMetaphlan2butnotenoughtoreachtheminimumcallthresholdof 100readsinthosesamples. LMAT-ML+Hunsupportedcalls UnsupportedcallswerelesscommonandlessconsistentacrossmultiplesamplesforLMAT-ML+Hthan fortheothermethods,andinvolvedfewerreads(Figure5).WhenweinvestigatedLMAT-ML+H differencesfromtheGranddatabase,mostofthedifferenceswereinsampleswherethenumberof readscalledhoveredaroundthe100readthreshold,sothatspeciescalledbyLMAT-ML+Hwerealso observedintheGrandresultsbutatabundancesjustbelow100reads(Table3).LMAT-Grandcallsfor theextractedspeciesreadsfromtheLMAT-ML+Hrunsalwaysincludedsomespeciescallstothespecies detectedbyLMAT-ML+Hforthosereads,aswellasalargernumberatahighertaxonomylevel.Manual inspectionofmanyexamplesshowedthatbothLMAT-ML+HandLMAT-Grandalwayscalledmultiple speciesinthegenus,althoughthedistributionofreadscountsamongthosespeciesdiffered.The averagescoreperspecieswasusuallylowerforLMAT-ML+HthanGrand.LMATcalculatesalogodds readscorefromthenumberofk-mermatchesinareadrelativetoanullmodelsimulatedwithrandom sequencesforeachdatabase,adjustingforGCcontentandreadlengthofthenullmodeltomatchthat oftheread.[14]TheLMAT-ML+Hdatabasehasmanyfewerk-mersthantheGranddatabase,resultingin loweraveragescores.Lowscoresindicateanorganismhasbestsimilaritytothattaxonomiccall,but doesnotperfectlymatchthereferencesequence,suggestinganovelvariant.ThecallstoTetrahymena thermophilahadaveragereadscoresofjustabove0.2forbothLMAT-ML+HandGrand,andsowere belowthethresholdforGrandbutnotLMAT-ML+H.ThesereadshadBLASTmatchesonlyto PlasmodiumyoeliiandTetrahymenathermophila,andwereveryrepetitiveandATrich.Classificationof suchlowcomplexityreadsisachallenge,especiallyconsideringthedifficultyofassemblingreference genomesforprotozoaaroundsuchregions,withthepotentialformisassemblyandcontaminationwith hostandothereukaryoticDNA.Insummary,differencesbetweenLMATGrandandLMAT-ML+Hwere bioRxiv preprint first posted online Jan. 14, 2016; doi: http://dx.doi.org/10.1101/036681. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. morelikelyfororganismswithlownumbersofreadsorlowscores,i.e.thosewithlowabundanceorlow similaritytothatgenomeinthedatabase. Fraction reads Number Fraction with samples BLAST matches with matches to unsupported tocalled synthetic Species call species contructs Streptococcus_sp._M143 21 0.506 0 genus,Streptococcus Streptococcus_sp._GMD4S 20 0.164 0 genus,Streptococcus Streptococcus_sp._oral_taxon_071 15 0.279 0 genus,Streptococcus Streptococcus_sp._AS14 13 0.765 0 genus,Streptococcus Lachnospiraceae_bacterium_MSX33 12 0.607 0 order,Clostridiales Streptococcus_sp._GMD6S 12 0.333 0 genus,Streptococcus Bacteroides_vulgatus 11 0.852 0 genus,Bacteroides Tetrahymena_thermophila 11 0.5* 0 norank,cellularorganisms Streptococcus_pseudopneumoniae 11 0.658 0 genus,Bacteroides Streptococcus_mitis 10 0.852 0 genus,Streptococcus Table3:BLASTanalysisofunsupportedcallsbyLMAT-ML *BLASTedagainstadatabaseofprotozoasequencesaswellasbacterialandviral,andusing-dustnoword_size20optionstoBLASTinadditiontotheoptionsdescribedinmethods CallsmadebyLMAT-Grandthatwerenotdetectedbyothermethods LMAT-GrandcallshadstrongBLASTsupport:theLMAT-identifiedspeciesdominatedtheBLASTmatches. ThissupportsthenotionthatthesearefalsenegativesbyGOTTCHA,MiniKraken,ClinicalPathoscope, andMetaphlan2,allofwhichusesubstantiallysmallerreferencedatabasesthanLMAT-Grand(Table4). Thesewerenotisolatedfalsenegatives,asmostoccurredinoverhalftheHMPsampleswecompared. Incontrast,forLMAT-ML+H,eventhemostcommonfalsenegativesoccurredinlessthanathirdofthe samples.FortheLMAT-ML+Hcomparisons,ineverycaseweexamined,thespecieswasactually detectedbytheLMAT-ML+Hbutitfellunderthe100readcountthreshold,soitwasamatterofslight quantitativedifferencesnearthethresholdratherthanqualitativedifferences,thesameresultas discussedabovefortheLMAT-ML+Huniquecalls.ThissuggeststhatwhenusingLMAT-ML+H, adjustmentsshouldbemadetoaccountforlowersensitivityofLMAT-ML+HthanLMATGrand.LMATML+HusestheidenticaldatabaseofreferencegenomesasLMATGrand,downselectingtoafractionof themosttaxonomicallyinformativek-mersandremovingredundancybyeliminatingmanyofthe overlappingk-mers,soitisnotsurprisingthatdifferencesbetweenthetwodatabasesaresmall. bioRxiv preprint first posted online Jan. 14, 2016; doi: http://dx.doi.org/10.1101/036681. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. Numberof samples withcallby Number LMAT reads, Grandbut summed notother across Species method method* samples Veillonella_dispar 83 CP,G,K 127783 Streptococcus_infantis 81 CP,G,K 777190 Granulicatella_adiacens 81 CP,G,K 37017 Gemella_haemolysans 80 CP,G,K 1099653 Porphyromonas_sp._oral_taxon_279 78 CP,G,K 13566 Prevotella_oris 77 CP,G,K 605452 Prevotella_salivae 77 CP,G,K 4050496 Leptotrichia_wadei 77 CP,G,K 3447770 Fusobacterium_periodonticum 76 CP,G,K 11218643 Lachnoanaerobaculum_saburreum 76 CP,G,K 823744 Streptococcus_oralis 78 M 409051 Tannerella_sp._oral_taxon_BU063 73 M 3267593 Streptococcus_pneumoniae 73 M 410193 Streptococcus_agalactiae 70 M 97079 Prevotella_sp._ICM33 70 M 1433785 Veillonella_sp._oral_taxon_158 68 M 8530 Neisseria_mucosa 68 M 277668 Prevotella_sp._F0091 68 M 411478 Actinomyces_sp._oral_taxon_175 62 M 14421 Streptococcus_sp._M334 62 M 456562 Streptococcus_suis 37 LMAT-ML+H 1820 LMAT-ML+H Candidatus_Saccharimonas_aalborgensis 31 8636 LMAT-ML+H Prevotella_sp._HJM029 26 4381 LMAT-ML+H Staphylococcus_sp._DORA_6_22 26 8839 LMAT-ML+H Streptococcus_sp._I-P16 26 7223 LMAT-ML+H Lactococcus_lactis 25 690 LMAT-ML+H Aggregatibacter_actinomycetemcomitans 23 1095 LMAT-ML+H Actinomyces_urogenitalis 22 448 LMAT-ML+H Streptococcus_sp._oral_taxon_056 21 122 LMAT-ML+H Streptococcus_sp._HSISS2 20 4621 Table4:BLASTanalysisofcallsmadebyLMATGrandbutnotothermethods. *CP=ClinicalPathoscope,G=GOTTCHA,K=Kraken,M=Metaphlan2 FractionBLAST matchesto LMATGrand calledspecies 0.949 0.956 0.927 0.995 0.931 0.998 0.999 0.680 0.992 0.833 0.935 1.000 0.957 0.986 0.940 0.878 0.759 0.919 0.765 0.876 0.917 0.273 0.917 0.921 0.896 0.991 0.883 0.616 0.656 0.929 bioRxiv preprint first posted online Jan. 14, 2016; doi: http://dx.doi.org/10.1101/036681. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. DetectingEukaryotes TheLMATdatabases(allvariants)aretheonlymetagenomicanalysistoolsthatincludeEukaryotic sequencesinadditiontohumaninthereferencedatabase.Asaresult,wewereabletoclassifyreads matchingfungi,protozoa,plants,andanimals(TableS1).Themajorityofeukaryoticreadswerehuman, followedbyFungiphylumDikarya,inwhichmanygeneraandspeciesweredetected.Readswere detectedfromdog(Canus,Canuslupus,orCanuslupusdomesticus;3sampleswith49-578readseach). ThefungalgenusMalasseziahadparticularlylargenumbersofreadsinmanysamples,anexpected resultforagenusnaturallyfoundontheskinofmanyanimals.Smallnumbersofreadsforvarious pathogenicprotozoainthephylumApicomplexaweredetected,includingPlasmodium,Acanthamoeba castellanii,Eimeria(coccidiosisinlivestock),Hammondiahammondi,andToxoplasmagondii.Five samplescontainedunusuallyhighnumbersofreads(~20,000)ofHammondiahammondi,whichrelies oncatsasitsdefinitivehost.Onestoolsamplehad~2,700readsofBlastocystishominis,a gastrointestinalparasiteofdisputedpathogenicity.Severalsamplescontained1000-3000readsof Entamoebanuttalli,knowntocauseillnessinnon-humanprimates.Thefreshwaterciliatedprotozoan Tetrahymenathermophilawasdetectedinanumberofsamples,mostoftenwithlowreadcount,except afewcaseswiththousandsofreads.Oneretroauricularcreasesamplehad~9,500readsclassifiedas Trypanosomacruzi(Chagasdiseaseorsleepingsickness),whichcanpersistunnoticedinthehostfor decades,althoughdetectiononskinmaynotmeanthehostisinfected.Smallnumbersofreadsof Trichomonasvaginalisweredetectedinahandfulofsamples.Whilethebulkofcallswerebacterial, theseobservationssuggestthatfurtherstudyconfirmingthepresentofeukaryotereadsintheHMP dataiswarranted.Manydrafteukaryoticsequencesappeartocontainmisassembledhumanand vector/syntheticsequence,andwehaveinvestedsubstantialefforttocontrolandcorrectforthis contamination[14].WeconductedspotcheckswithBLASTonreadsassignedtoeukaryotes(including MalasseziaandCanuslupus)toconfirmthatobviousmiss-assignmenterrorswerenotpresent. Nevertheless,non-humaneukaryote-classifiedreadsdeserveanextrameasureofcaution,suchas demandinghigherreadcountorscorethresholdsforcallingaspeciesaspresent.LMAT-ML+Hisless sensitivethanLMATGrandfortheseorganisms,asmanualcomparisonsofseveralofthesespecies showedfewerreadsandlowerscoresintheML,butitisstillcapableofdetectingtheseorganismsusing 16-24GBmemory. Discussion SensitivitywastoolowforSIANNtobefeasibleforclinicalsamples,andfalsepositiveswereseenin sampleswithpathogenathighspikedconcentrations.MiniKrakenprocessedmetagenomereadsthe fastest,butitwasalsotheleastaccurateontherealworldHMPsamples,withpoorBLASTsupportfor thespeciesitcommonlydetectedandalargenumberofmissedspeciesthatdidhavestrongBLAST support,andclassifiedanaverageofaboutathirdofthereads.ClinicalPathoscopeandGOTTCHAalso hadpooraccuracy,theyweretheslowestclassifiersinourtests,andtheyfailedtoclassifyevenlarger percentagesofreads.MetaPhlan2wasfasterthanLMAT-ML,althoughitonlyclassifiedanaverageof5% ofreadsandmissedmanyspeciesthatwereclearlypresentbasedonBLASTresults.LMAT-MLranon averageatabouthalfthespeedofMetaPhlan2andrequired24GBofDRAMtoavoidanyperformance penaltyforpagingthedatabaseindex,morethanothermarkerlibrarymethods.However,LMAT-ML+H classifiedover60%ofreadsandshowedfarbetteraccuracythanothermarkerlibrarymethodsas verifiedbyBLAST,deliveringresultsnearlyascompleteasthoseofLMATGrandwithonlyafractionof thememoryrequirements,ataspeedcapableofanalyzingagigabase-sizedsampleinabout3.5minutes bioRxiv preprint first posted online Jan. 14, 2016; doi: http://dx.doi.org/10.1101/036681. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. with24CPUand24GBofDRAM.Additionally,theperformancepenaltyfor16GBversus24Gbis roughly50%slowerwiththesmalleravailableDRAM(whenusingLMAT-ML+H). TheprocessingrateofLMAT-MLdatabasesareaffectedbytwokeyfactors,therateatwhichtheindex canbeaccessed,andtheclassificationrateofanindividualread,whichisimpactedbythenumberof taxonomyidentifiersretrievedforeachconstituentk-merintheread.Fastaccesstotheindexis impactedbythesizeofthedatabase,withlargerdatabasesrequiringadditionalpaging. WeconsidertheuseoftheSATAIISSDasitprovidesarelativelylowcostalternativetoDRAMfor storage.CurrentadvertisedpricesforDRAMareabout$11andforSATASSD$0.75pergigabyte,and midtierPCIeflasharound$5/GB.WepresenttheuseofanSSDwithLMATandlimitedmemoryin contrasttopreviousexperimentswithPCIeflashwithlargerLMATdatabaseindexes[19].Additional experimentationwiththelargerGranddatabaseandSATASSDwithlimitedmainmemoryhave demonstratedsubparperformance.ForthesmallerMLdatabases,however,theperformancereduction withSATASSDwasminor,sincedemandsonmainmemoryweremuchlower.AlthoughLMATruns fasterinDRAMonly,weestimatethatalaptop/desktopwith16GBRAManda24GBflashdrivewill performrapidandaccuratemetagenomeanalyseswithLMAT-ML+H. Whileallmethodsaimtoclassifyreadsatthemostspecificlevelpossible,thatlevelofspecificitymust besupportedbythedata.AllofthemethodsotherthanLMATfailedtoidentifygenus,family,phylum, orevenhigherlevelsofconservationintheHMPreads,andthusreportedoverlyspecificcalls. DatabaseslikeRefSeqhaveonlyonerepresentativesequenceformanyspeciesandsomegenera,and documentationexplicitlystatesthatmorethanonestrainwillbeincludedonlyinexceptional circumstancesasdeterminedmanuallybyNCBIstaff[20].Thisrenderssuspectanystraincallsmadeby classifiersrelyingonRefseq,sincetherearenotenoughnearneighborstoresolveatthislevel.In addition,MiniKraken,ClinicalPathoscope,andGOTTCHAmadeerrorsbymisclassifyingreadsas microbialthatweremuchmorelikelytobeartificialsequencesfromsamplepreparation.Metaphlan2 didnotmisclassifyartificialsequences,butitdidmisclassifyhumanreadsasviral.Weandothers[9,21] havefoundsubstantialcontaminationofdraftgenomeswithadaptor,vector,andothersynthetic sequencesandhumanorotherhostsequences.LMAT-GrandandLMAT-MLspecificallylabelsynthetic andhumank-mersandthenapplyagreedystrategytodetectreadswiththesesequences.Thisallows theLMATdatabasetocontainlargenumbersofdraftsequencestospannovelstrainandspecies diversitywithoutmisclassifyinghumanandsyntheticreadsasmicrobialduetocontaminateddraft assemblies. MiniKraken,ClinicalPathoscope,andGOTTCHAreferencesequencesconsistofNCBIRefSeqcomplete bacterial,archaeal,viralgenomesandthehumanreferencegenome,andMetaphlan2andSIANNuse evensmallersubsetsofthesesequences.WithLMAT-GrandandLMAT-ML,weextendthereference databasetospaneverymicrobialgenomeinthepublicdomain,andmore,sothat LMATistheonlymetagenomeclassificationsoftwarethatincludes1)eukaryoticsequenceinboththe GrandandtheLMAT-MLdatabases,enablingtheclassificationoffungi,protozoa,andsomemulticellular organisms(fromorganelleslabeledaswholegenome,e.g.mitochondriaandchloroplasts);2)draft genomesandassembledcontigsnotcontainedinNCBIRefSeq;and3)draftandfinishedbacteria,virus, archaea,fungi,andprotozoagenomesfromanumberofsequencingcentersworldwidewithpublicly availablesequencedatainadditiontothosefromNCBI.Whilesequencesavailablefromthesesites eventuallyappearinNCBIdatabases,theymaybepubliclyavailableyearsbeforereleaseatNCBI,and manystrainsmayneverbecomeapartofNCBIRefSeq. bioRxiv preprint first posted online Jan. 14, 2016; doi: http://dx.doi.org/10.1101/036681. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. SinceLMATincludesallavailablestrains(genomes)thathavebeensequenced,itshouldprovidemore accuratespeciesresolutionandreportingofthecloseststrainsinthedatabasethanothermethods. LMATreportsthenumberofreadsmatchingmultiplestrains,andforthosereadsconservedacross multiplestrains,thenitreportsonlythespecieslevelmatch,sinceNCBItaxonomynodesingeneraldo notexistforclades.Evenforisolateswithsequencedgenomes,theremaybemorethanonebeststrain matchfordifferentsubsetsofreadsduetoevolutionfromtheoriginalsequencedisolateduring propagation,lateralgenetransfer,recombination,andsequencingerror.Aphylogeneticapproachis probablynecessarytoaccuratelyplacenewsequenceinthebroadercontextofotherisolates,possibly usingassembly,alignmentorSNPs,providedthereissufficientgenomecoverage.Othermethodsthat failtoincludemanyspeciesandstrainsintheirreferencedatabasecannotresolvespecificstrains. Resultspresentedhereshowthatthesemethodsareincorrectoroverlyspecificevenintheirspecies andgenuslevelclassifications. NotonlyisLMATtheonlymethodthatcandetecteukaryoticsequences,itreportscallstoplasmids versuschromosomesfordistinguishingthepresenceofthesemobilegeneticelements.Metaphlan2, GOTTCHAandMiniKrakendonotdetectplasmids,andmakeonlytaxonomiccalls.ClinicalPathoscope doesidentifyreadsbydatabaseentry,soitispossibletodistinguishplasmidfromchromosomal matches.LMATdistinguishesplasmidcalls,andcreatesafileexclusivelylistingtheplasmidsdetected, andalsoincludesthoseplasmidcallsintheoverallresultssummarywithalltaxonomiccalls.For methodsotherthanLMAT,thereisnoprocessdescribedinthemanualsforextractingreadsresponsible foragivencall,makingitdifficulttoverifythosecalls,anddoadditionalanalysessuchasassembly,per speciesgeneannotation,SNPanalyses,ordistinguishmatchestoplasmidversuschromosome.Plus, failuretoreportstandardizedNCBItaxonomyidentifiersforallcallsbysomeofthemethods,plususeof nonstandardoroutdatedspeciesnamesandGInumbersnotinthecurrentNCBIdatabasemakesthe processespeciallychallenging.Weencouragesoftwaredeveloperstodescribeproceduresforextracting readsforthetaxonomiccallsmadebythemethod,tofacilitatecallverificationfromthereads responsibleforeachcall. Alignmentbasedmethods(e.g.BLASTandreadmapping)scalelinearlywiththenumberofbasesinthe referencedatabase.Toscalewithanevergrowingpoolofreferencegenomes,alignment-based softwaremustreducetoonlyasubsetoftheavailabledatabyexcludingstrainvariants,draftgenomes, andnon-microbialkingdoms.Asaconsequence,thesemethodsfailtoclassifylargenumbersofreads, reportoverlyspecificclassificationsforsequenceswhichinfactaremorewidelysharedacrosstaxa,and eithermisclassifyorfailtodetectallthespeciesandgeneramissingfromthedatabase.Inaddition, alignmentsrequireacaponthemaximumnumberofalignmentstoreturntoretainreasonablerun times.Wehaveobservedthatforhighlyconservedsequences(like16SrRNAorhousekeepinggenes) wheretheremaybethousandsofadditionalunreportedmatchesoverthatmaximum,thesortorderof reportingmatchescanresultinbiasesandoverlyspecificcallsfortaxathatmaynotactuallybepresent. Incontrast,thek-merbasedapproachhastheadvantageofretainingandcondensingconserved subsequencessothataddingrelatedreferencegenomesincreasesthedatabasesizeonlyfornovelkmersandthesmallincrementofaddingthatgenometagtoexistingk-mersalreadystored.Thus,the databasesizegrowsasafunctionofsequencediversity,notasastrictlylinearincreasewiththenumber ofbasesinthereferencedatabase.WehaveplanstoaddmoreeukaryoticgenomestotheLMAT database(e.g.mosquitos,nematodes,ticks,plants),toclassifymorereadsfromenvironmentalsamples, whichshouldbetremendouslyhelpfulinfieldssuchasbioenergy,microbialecology,industrial metagenomics,andenvironmentalbiosurveillance.[4]Manyeukaryoteshaveextremelylargeand repetitivegenomes[22],soak-merthatscaleswithdiversityratherthatthegenomesizeincluding bioRxiv preprint first posted online Jan. 14, 2016; doi: http://dx.doi.org/10.1101/036681. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. moreeukaryotesshouldfurtherreducethenumberofunclassifiedreads,andimproveourabilityto separatereadsfortrulynovel,unknownmicrobesforfurtheranalysis. ComparingresultsfromactualHMPsamplesacross6metagenomeanalysissoftwarepackages,we foundthattheLMATMarkerLibrary“LMAT-ML+H”classifiedmicrobialcontentsmostaccuratelyand comprehensivelyduetoitsrelianceonareferencedatabase1-2ordersofmagnitudelargerthanthatof othersoftwareandrepresenting2-4timesmorespecies.Itsspeediscompetitivewithothertools,and althoughmemorydemandsarehigher,theyarestillwellwithinthepricerangeofastandarddesktop machinewith24GBofmemoryor16GBmemorywithalowcostSSDdrive. Acknowledgements WethankMarisaTorresandClintonTorresforbuildingtheinfrastructuretodownloadandupdatethe referencesequencedatabaseusedbyLMAT.ThisworkwasperformedundertheauspicesoftheUS DepartmentofEnergybyLawrenceLivermoreNationalLaboratoryunderContractDE-AC52-07NA27344. LaboratoryDirectedResearchandDevelopment(33-ER-2012and08-ER-2011); ConflictofInterest:nonedeclared. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. MillerR,MontoyaV,GardyJ,PatrickD,TangP:Metagenomicsforpathogendetectioninpublic health.GenomeMedicine2013,5(9):81. PadmanabhanR,MishraAK,RaoultD,FournierP-E:Genomicsandmetagenomicsinmedical microbiology.JournalofMicrobiologicalMethods2013,95(3):415-424. DiBellaJM,BaoY,GloorGB,BurtonJP,ReidG:Highthroughputsequencingmethodsand analysisformicrobiomeresearch.JournalofMicrobiologicalMethods2013,95(3):401-414. NeelakantaG,SultanaH:TheUseofMetagenomicApproachestoAnalyzeChangesin MicrobialCommunities.MicrobiologyInsights2013,6(3641-MBI-The-Use-of-MetagenomicApproaches-to-Analyze-Changes-in-Microbial-Comm.pdf):37-48. NaccacheSN,FedermanS,VeeeraraghavanN,ZahariaM,LeeD,SamayoaE,BouquetJ, GreningerAL,LukK-C,EngeBetal:Acloud-compatiblebioinformaticspipelineforultrarapid pathogenidentificationfromnext-generationsequencingofclinicalsamples.Genome Research2014,24:1180-1192. DesaiN,AntonopoulosD,GilbertJA,GlassEM,MeyerF:Fromgenomicstometagenomics. CurrentOpinioninBiotechnology2012,23(1):72-76. WilkeningJ,WilkeA,DesaiN,MeyerF:UsingCloudsforMetagenomics:ACaseStudy.2009 IeeeInternationalConferenceonClusterComputingandWorkshops2009:80-85. SegataN,WaldronL,BallariniA,NarasimhanV,JoussonO,HuttenhowerC:Metagenomic microbialcommunityprofilingusinguniqueclade-specificmarkergenes.NatMeth2012, 9(8):811-814. WoodD,SalzbergS:Kraken:ultrafastmetagenomicsequenceclassificationusingexact alignments.Genomebiology2014,15(3):R46. ByrdA,Perez-RogersJ,ManimaranS,Castro-NallarE,TomaI,McCaffreyT,SiegelM,BensonG, CrandallK,JohnsonW:ClinicalPathoScope:rapidalignmentandfiltrationforaccurate pathogenidentificationinclinicalsamplesusingunassembledsequencingdata.BMC Bioinformatics2014,15(1):262. bioRxiv preprint first posted online Jan. 14, 2016; doi: http://dx.doi.org/10.1101/036681. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. AllenT,FreitasK,LiP-E,ScholzMB,ChainPSG:Accurateread-basedmetagenome characterizationusingahierarchicalsuiteofuniquesignatures.NucleicAcidsResearch2015, 43(10):e69. MinotS,TurnerSD,TernusKL,KadavyDR:SIANN:StrainIdentificationbyAlignmenttoNear Neighbors;2014. AmesSK,GardnerSN,SlezakTR,GokhaleMB,AllenJE:Usingpopulationsofhumanand microbialgenomesfororganismdetectioninmetagenomes.GenomeResearch2015,25:10561067. AmesSK,HysomDA,GardnerSN,LloydGS,GokhaleMB,AllenJE:Scalablemetagenomic taxonomyclassificationusingareferencegenomedatabase.Bioinformatics2013,29(18):22532260. AltschulSF,GishW,MillerW,MyersEW,LipmanDJ:Basiclocalalignmentsearchtool.Journal ofmolecularbiology1990,215(3):403-410. JurkaJ,KapitonovVV,PavlicekA,KlonowskiP,KohanyO,WalichiewiczJ:RepbaseUpdate,a databaseofeukaryoticrepetitiveelements.Cytogeneticandgenomeresearch2005,110(14):462-467. AllenJE,GardnerSN,SlezakTR:DNAsignaturesfordetectinggeneticengineeringinbacteria. Genomebiology2008,9(3):R56. FreyKG,Herrera-GaleanoJE,ReddenCL,ThissenJL,DyerM,AllredA,MokashiV,GardnerSN, SlezakT:TowardsQuantitativeMetagenomics:TargetedEnrichmentforDetectionof BioThreatAgents.In:IonWorld:October21-222013;Boston,MA. AmesS,AllenJE,HysomDA,LloydGS,GokhaleMB:DesignandOptimizationofa MetagenomicsAnalysisWorkflowforNVRAM.In:13thIEEEInternationalWorkshoponHigh PerformanceComputationalBiology.May2014. PruittK,BrownG,TatusovaT,MaglottD:TheReferenceSequence(RefSeq)Database.In:The NCBIHandbook[Internet].EditedbyMcEntyreJ,OstellJ:NationalCenterforBiotechnology Information(US);2002. MerchantS,WoodDE,SalzbergSL:Unexpectedcross-speciescontaminationingenome sequencingprojects.PeerJ2014,2:e675. NeneV:Tickgenomics--comingofage.Frontiersinbioscience(Landmarkedition)2009, 14:2666-2673.