Download Searching more genomic sequence with less memory for fast and

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
bioRxiv preprint first posted online Jan. 14, 2016; doi: http://dx.doi.org/10.1101/036681. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
Searchingmoregenomicsequencewithlessmemoryforfastand
accuratemetagenomicprofiling
SheaN.Gardner1,3,SashaK.Ames2,3,MayaB.Gokhale2,3,TomR.Slezak1,3,JonathanE.Allen1,3+
1
GlobalSecurityComputerApplicationsDivision
CenterforAppliedScientificComputing
3
LawrenceLivermoreNationalLaboratory,LivermoreCA
2
+
Addressforcorrespondence:[email protected]
Abstract
Softwareforrapid,accurate,andcomprehensivemicrobialprofilingofmetagenomicsequencedataona
desktopwillplayanimportantroleinlargescaleclinicaluseofmetagenomicdata.Herewedescribe
LMAT-ML(LivermoreMetagenomicsAnalysisToolkit-MarkerLibrary)whichcanberunwith24GBof
DRAMmemory,anamountavailableonmanyclusters,orwith16GBDRAMplusa24GBlowcost
commodityflashdrive(NVRAM),acosteffectivealternativefordesktoporlaptopusers.Wecompared
resultsfromLMATwithfiveotherrapid,low-memorytoolsformetagenomeanalysisfor131Human
MicrobiomeProjectsamples,andassesseddiscordantcallswithBLAST.AllthetoolsexceptLMAT-ML
reportedoverlyspecificorincorrectspeciesandstrainresolutionofreadsthatwereinfactmuchmore
widelyconservedacrossspecies,genera,andevenfamilies.Severalofthetoolsmisclassifiedreadsfrom
syntheticorvectorsequenceasmicrobialorhumanreadsasviral.Weattributethehighnumbersof
falsepositiveandfalsenegativecallstoalimitedreferencedatabasewithinadequaterepresentationof
knowndiversity.OurcomparisonswithrealworldsamplesshowthatLMAT-MListheonlytooltested
thatclassifiesthemajorityofreads,anddoessowithhighaccuracy.
Introduction
Recentstudiesshowthatthemicrobiomeplaysanimportantroleinthehealthofhumans,animals,and
naturalandagriculturalsystems.[1-4]Metagenomicsequencingofhumanmicrobiomeshasalready
contributedtodiagnosingandtreatingsickpatients[5],andispoisedtoplayamuchlargerrole,
providedthatthetechniquecandeliveraccurateandtimelyanalysisofmulti-gigabasesofunassembled
reads.Metagenomicanalysistypicallydemandssubstantialcomputingresources,eitherintermsofCPU
ormemory,orboth,andruntimescanexceedthetimeforsequencing.[6]Asinstitutionsinvestin
sequencinginfrastructure,theymaynothaveaparallelcapabilitytoinvestandmaintainlargecompute
clusters,andissuesofpatientprivacyordatatransferbottlenecksmaydiscouragecloudorcentralized
analysis.Forlargedatasets,runningBLASTanalysisonAmazon’sEC2cloudwasseveraltimesmore
expensivethanthesequencingitself,andcostsofsequencingaredecliningfasterthanthoseof
computing.[7]Rapid,sensitive,andaccuratemethodsoftaxonomicclassificationofthesample
contentsthatcanrunonrelativelowpricedesktopmachinespromiseasolution.
Toachievethegoaloffast,accuratemetagenomicanalysis,variousmetagenomeanalysissoftware
packagesreducetheoriginalsequencedatabasetoasmaller,moreeasilysearchablemarker-library
containingataxonomicallyinformativesubset.Metaphlan2[8]matchesreadstoasmallsetofmarker
genes,single-copygenespresentinmanybacteria,orclade-specificgenestodotaxonomicclassification
bioRxiv preprint first posted online Jan. 14, 2016; doi: http://dx.doi.org/10.1101/036681. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
andabundanceestimationwithoutattemptingtoclassifyallreads.Kraken[9]matchesthek-mers
(k=31)inareadtothoseinareferencedatabase.Itpre-computesthelowestcommonancestor(LCA)of
referencesequencescontainingeachk-mer,andappliesatreetraversalalgorithmtotaxonomically
labelareadfromitsk-mers.ThefullKrakendatabase,however,doesnotfitinmemoryoncommon
desktopcomputingsystems.TheMiniKrakendatabasewascreatedasalessmemoryresourceintensive
alternativeforrunningondesktopsystems.ClinicalPathoscope[10]aimstoidentifypathogensin
clinicalsamplesbyalignmenttoNCBIbacterialandviralreferencegenomesandBayesianstatistical
confidenceestimation.GOTTCHA[11]mapsreadstoapre-computeddatabaseofuniquesubsequences
atmultipletaxonomicranks(family,genus,species,strain,etc.).SIANN[12]alsomapsreadstoaprecalculateddatabaseofspecies-andstrain-specificregionsofpathogensandtheirnearneighbors.Unlike
theothermethodsmentionedhere,SIANNwasdesignedforaspecifictaskofrapidlyassessingwhether
anymembersofadefinedsetofpathogensispresentinametagenomicssample.Itisnotageneralpurposemetagenomicstool.SURPIpresentsanotherrecentpathogendetectionsystem,whichusesan
approachsimilartoClinicalPathoscopebymappingreadstoreferencegenomesfororganism
identification,butitrequiresmorememory(60GB)thanisavailableonatypicaldesktop.[5]
LMATusesareferencegenomedatabasethatcontainsbothdraftandfinishedgenomesfrombacteria,
archaea,viruses,andsomeeukaryotesincludingpathogenicprotozoa.Thissetofreferencegenomesis
morethan11-foldlargerthananyothermetagenomeanalysisreferencesequencedatabase.[13,14]
LMATindexeseveryk-mer(k=20)inareferencegenomedatabasewithallofthesequencescontaining
thatk-mer.Itimplementsa“pruning”strategytoretainonlyhigherleveltaxonomiclabelsfork-mers
sharedbymorethansomepre-specifiednumberofsequencesdownataxonomicbranch.Itstillretains
multipletaxonomicnodesandgenomesperk-mer,andthusmorecompleteinformationaboutwhich
sequencescontainak-merthanispossiblewithanLCAapproach,whichstoresonlyasingletaxonomic
nodeperk-mer.Thisallowshigherresolution(e.g.speciesandstrain)callswhenthedatawarrants.The
fulldatabase(LMATGrand)requires500GBofDRAMorflashmemorysoisnotfeasiblefor“desktop”or
typicalclusterusers.LMAT-Grand’sextensiverepresentationofgenomicvariationleadstolabelinga
largefractionofreads,whichisusefulforsomeapplications(suchasreadbinningforassembly)butis
morecomputationallycostlythanmaybeneededfororganismidentification.TheLMATMarkerLibrary
(ML)reducestheRAMrequirementsbypreselectingonlythemosttaxonomicallyinformativeandnonoverlapping(i.e.non-redundant)20-mersforindexing,andbyimposingmorestringentpruning.Thus,
memoryrequirementsofLMAT-MLarereducednotbylimitingthetaxonomiccoverageandstrain
resolutionofthereferencedatabase,butbypre-selectingthesubsetofk-merswiththehighest
taxonomicinformationcontent.Moreover,themarkerlibraryapproachhasthepotentialtorunatleast
anorderofmagnitudefasterbycorrectly“ignoring”thelesstaxonomicallyinformativeportionsofthe
queryset.PartoftheworkwepresenthereevaluatedseveralLMAT-MLpruninglevelstodeterminethe
optimalbalancebetweenmemory,speed,andconsistencyofresultscomparedtoLMATGrand.Wealso
comparedanLMAT-MLthatcontainedonlymicrobialk-merstoLMAT-ML+Hthatalsocontainedallthe
humank-mersinLMATGrand.
Eachoftheabovemethodsmakesdifferenttradeoffsintermsofmemory,speed,sensitivity,and
accuracy.Wechosetocomparethesemarkerlibrarymethodsusingdatafromactualmicrobiomedata
forseveralreasons.Previousstudieswithsimulateddatasetsensurethatthesimulatedreadscome
fromorganismsorcloserelativespresentinthereferencedatabase,whichcannotbeassumedinreal
samples.Constructingrealistic,robustsimulatedmetagenomicbenchmarkingdatasetsremainsa
fundamentalchallengeandwillbenefitfromemergingcommunityeffortstoconstructresources
availableforthirdpartyvalidation.Whiletheseresourcesdevelop,ourgoalwastocomparetools,
whichcouldberunwith16GBofRAMorlessonanimportanttargetsubsetofmetagenomics–human
bioRxiv preprint first posted online Jan. 14, 2016; doi: http://dx.doi.org/10.1101/036681. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
metagenomicsusing131HumanMicrobiomeProject(HMP;http://hmpdacc.org/HMASM/)samples
randomlyselectedtospanbodysitesandgenders,andcheckdiscordantresultsbetweenmethodswith
BLAST.ThesesamplesmaycontainmanyspeciesnotincludedinNCBIRefSeq
(http://www.ncbi.nlm.nih.gov/refseq/),orrepresentedonlyasdraftgenomes,aswellassomethathave
notyetbeensequenced.Althoughwedonotknowgroundtruth,welookedatcasesofdisagreement
betweentheLMAT-Grandmethodthatqueriesthemostcomprehensivereferencedatabaseandeachof
themarkerlibrarybasedmethods.Weexaminedthereadsresponsiblefordiscordantspeciescallsusing
BLAST[15]searchestoassessiftheyweremostlikelyfalsenegativesforonemethodorfalsepositives
fortheother.WereportonthespeedandaccuracyofthesemetagenomeanalysistoolsforHMP
samples.Ourevaluationconsidersahardwareconfigurationof16GBDRAMandalowcostcommodity
NVRAMintheformofasolid-statedrive(SSD),whichshouldbeaccessibleforusewithexistingdesktop
computers.TheLMATMLdatabasesrangefromapproximately13GBto19GBinrequiredstorage;thus,
thisrangecoversbothfittinginandexceedingtheavailableDRAM.Thisrangeallowsustomeasure
impactofdatabasesizeonLMATclassificationperformance.
MaterialsandMethods
BuildingLMAT-ML
TheLMATreferencegenomedatabaseincludes1)eukaryoticsequenceoffungi,protozoa,andsome
multicellularorganisms(fromorganelleslabeledaswholegenome,e.g.mitochondriaandchloroplasts);
2)draftgenomesandassembledcontigsfromunfinishedWholeGenomeShotgun(WGS)genome
sequencingprojects(ftp://ftp.ncbi.nih.gov/genbank/wgs/);and3)draftandfinishedbacteria,virus,
archaea,fungi,andprotozoagenomesfromanumberofsequencingcentersworldwidewithpublicly
availablesequencedatainadditiontothosefromNCBI1.Anextensivecollectionofartificialvector
sequence[17]isalsoincludedtofilteroutcontaminatingsequenceindraftassemblies.Thefungiand
protozoasequencedatacamefromtheFungiandProtistGroupBioProjectsreportedintheNCBI
eukaryotesgenomereport(ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/eukaryotes.txt),
andBioProjectsequenceswereextractedbytheassemblyaccession.
Toselectthek-mersforinclusioninLMAT-ML,weusedtheproceduredescribedin[14],usingk=20
insteadof18,andaddinganextrasteptoremoveoverlappingk-mersrelativetoareferencesequence
(randomlyselectedfromthosecontainingaseriesofadjacentk-mers).Briefly,theobjectivewasto
identifyacollectionofk-mersthatareuniquelyassociatedwithphylogeneticallydistinctsetsofgenomic
sequence.Groupsofgenomesaredefinedbytheirsharedk-merswithaminimumof200k-mersshared
withinagroupforviralgenomesand1000k-merssharedwithinanon-viralgroup.Theminimum
thresholdsweresettomaintaingroupsofgenomesthatretainsomedegreeofphylogenetic
relatedness.Anyk-merfoundinmorethanonegroupwaseliminatedtoyieldasetofk-mersthatare
uniquelyassociatedwithdifferentlevelsofthetaxonomyhierarchy.Additionally,k-mersmatching
RepBase18.06[16]wereeliminated.Sincetheresultingk-mersetstillyieldedadatabasewithalarger
memoryfootprintthanwouldbepracticalforadesktoporlaptop,k-mersweremappedtoarandomly
selectedrepresentativesequencefromthegenomegrouptoremovemultipleadjacentoverlappingk
1
SangerCenter,J.CraigVenterInstitute,BaylorCollegeofMedicine,WashingtonUniversityin
St.Louis,BeijingGenomeInstitute,IntegratedMicrobialGenomes,EuropeanMolecularBiology
Laboratory,etc.
bioRxiv preprint first posted online Jan. 14, 2016; doi: http://dx.doi.org/10.1101/036681. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
mers.Next,synthetick-mersfromLMAT-Grandwereaddedtodetectsynthetic/vectorsequences.
Finally,a“+H”versionoftheLMAT-ML’swascreatedtousethehumank-mersfromLMAT-Grandto
accuratelyclassifyhumanhostsequences,whilemaintainingsmallmemoryrequirements.
Weevaluatedtwotaxonomypruningstrategiestoimprovespeedandreducethememory
requirements,asdescribedindetailin[13].Evaluatingthesepruningstrategiesisnecessarytoassess
thepossiblecostinlostaccuracywhileimprovingspeed.Suchanevaluationisanimportantstepin
understandingthepotentialforLMAT-MLsincethepruningstrategyisauniquepropertyofLMAT’s
approachtoclassification,incontrasttoothertools.Asabaseline,whennopruningisused,indicated
as“–All”,everytaxonomyidentifieroflowestavailablerank(e.g.speciesorstrain)foragivenk-meris
retainedinthesearchabledatabase.The–Minpruningoption(LMAT-ML-MinandLMAT-ML+H-Min)
storesonlythelowestcommonancestor(LCA)foreachk-mer,similartotheapproachemployedby
Kraken.
Thealternativepruningoptiontestedstoresamaximumof10taxonomyidentifiersperk-mer(LMATMLandLMAT-ML+H).LMATdatabaseswiththe“+H”labelincludeallhumank-mersinadditiontothe
microbialk-mers.Inthisoptioneachk-merislinkedtoasetoftaxonomicidentifiersthatcontainthe
lowestcommonancestorforallsequencescontainingthek-merandupto9descendentidentifiers,
whichmustretainacommonrank.Forexample,iftheLCAisofrankgenus,andthek-merisfoundin
nineorfewerdistinctspecies,thenalldistinctspeciesidentifierswouldberetained.Ifthek-meris
foundinmorethanninespecies(butonlyonegenus)thenonlythegenusidentifierwouldberetained.
TheLMATdatabaseusesatwo-levelindexdatastructuredescribedindetailin[19]toimprovethe
efficiencyofk-mersearch.Ak-merisrepresentedbytwonon-overlappingbitvectors,witha20-mer
representedby40bitsandthefirstNbitsstoredinthefirstleveloftheindexandthesecond20-N
lowerorderbitsstoredinthesecondlevel.ThechoiceofNwasoptimizedforusewiththeML.Asplitof
N=25bitswasselected,whichreducedthesizeoftheMLdatabaseby1.5GB,comparedwithprevious
settingsdevelopedforusewiththefulldatabase.AnewextensiontotheLMATsoftwarewasadded(v
1.2.5),enablingthetwo-levelsplittobespecifiedforthetargetdatabase.Thisfeatureallowseach
databasetobeuniquelytunedforefficientuseofspaceadjustingforthesizeofthedatabase.Thesplit
parameterwasadjustedfromtheprevioussettingusedforthelargerdatabase(LMAT-Grand)todeploy
thesmallermarkerdatabasesonalowcostSSDdevice.
Markerlibrarycomparisons
Weusedasetof131HMPsamplesrandomlychosentospanallbodysitesforbothgenders(Additional
file1:131HMP_Samples.xlsx),withpreprocessingtotrimnon-biologicalportionsofreads(adaptors),
trimorreplacelowqualitybases(Q<10)withN,andcombinedpairedendsequencesasdescribedin
[14].WecomparedresultsfromClinicalPathoscopev1.0.3,Metaphlan2(db_v20),GOTTCHA
(downloadedSept.2,2014),MiniKraken(kraken-0.10.4-beta),SIANN(v1.12),theLMAT-ML+H,andthe
LMATGranddatabase.Eachmethodwasrunwithdefaultparameters.ClinicalPathoscopetarget
databaseswerebacteriaandvirus,andhostfiltrationdatabasewashuman.GOTTCHAwasrunagainst
theGOTTCHA_BACTERIA_c3514_k24_u24_xHUMAN3x.speciesand
GOTTCHA_VIRUSES_c3498_k85_u24_xHUMAN3x.speciesdatabases,microbialdatabasesinwhich24mersmatchingthehumanreferencegenomehadbeenremoved,andresultswerecombinedfor
bacteriaandviruses.ResultsfromallLMAT-MLswithnoormoderatepruningwerenearlyidenticalin
termsofconsistencywiththeLMAT-Granddatabase,whileresultswereslightlyworseforLMAT-ML-Min
withouthumank-mers(FigureS1),soforallthemarkermethodcomparisonsweusedLMAT-ML+H.We
bioRxiv preprint first posted online Jan. 14, 2016; doi: http://dx.doi.org/10.1101/036681. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
uniformlyappliedthedefaultcutoffvaluesof0.5forLMAT-Grandand0.2forLMAT-ML+H.Werequired
aminimumof100readsperspeciestocallthatspeciesaspresentforeachofthemethods,and
observedthatresultsweresimilaracrossthresholdsof50-2000reads(resultsnotshown).Readsfor
methodsthatreportedcallsaspercentagesofmappedortotalreadswereconvertedtoabsolute
numbersofreads.ForMetaphlan2,sincetherelativeabundancewasnotstraightforwardtoconvertto
readcounts,weusedthepercentabundancemultipliedbythetotalnumberofreadsmappedasa
proxy.WeidentifiedtheNCBIspecies-leveltaxonomyidandNCBIspeciesnameforallspeciesandstrain
callsbyeachmethod,anontrivialprocesssincesomemethodsreportnon-standardspeciesnamesand
notaxonomyidentifiers,andusedeprecatedGInumbers,whicharenotinthecurrentNCBIdatabases.
Foreachmethod,the10mostcommonlycalledspeciesdetectedbythatmethodandnotbyLMAT-
Grandwereidentified,andthereadsidentifiedbythatmethodasbelongingtothatspecieswere
extractedbasedonparsingthebowtie2orSAMoutput,identifyingthetaxonomybyGInumber,NCBI
geneID(Metaphlan2),orasalastresortbyorganismname,usingNCBItables2.Weacknowledgethat
wemayhavemissedextractingsomereadswhichagivenmethodusedforclassification,sincethese
methodsdonotdescribestandardizedproceduresforreadextractionandcallvalidation.For
Metaphlan2inparticular,wewereunabletoextractasmanyreadsasweexpectedbasedonreported
percentageabundances.Readswerecombinedintoasinglefileperspeciesforeachmethod.These
readswerethencomparedusingBLAST(blastn-evalue0.0001-max_target_seqs5)toour
comprehensivedatabaseofallbacterialandviralsequences,andtheoutputprunedtoshowonlythe
matchestoeachreadwiththeloweste-values,allowingmultiplematchesperreadwiththesame
loweste-value.ThesereadswerealsocomparedusingBLASTtoacomprehensivedatabaseofvectors
andotherartificialsequences,containingthesequencesinUniVecaswellasIlluminaadaptorsanda
numberofcommerciallyavailablevectorsequences[17].ThereadswerealsocomparedusingLMAT
withtheGranddatabase,andthetaxonomiccallwithhighestreadcountwasreported.
The10mostcommonspeciescallsmadebyLMAT-Grandandnotanothermethodweregatheredfor
eachmethod,ignoringHomosapiensandLMAT-Grandclassificationforsyntheticconstructs,which
LMAT-Grandreportsasa“species”.Thetop10listwasidenticalforClinicalPathoscope,GOTTCHA,and
MiniKraken,whichalluseNCBIRefSeqtobuildtheirreferencedatabases.Readsforthesespecieswith
LMAT-Grandscoresofatleast1wereextractedandcomparedusingBLASTagainstourcomprehensive
viral/bacterialgenomedatabase,andinonecasewhereaprotozoawasuniquelydetected,againstour
protozoadatabase.
Results
LMATdatabasesencode11-147timesmoresequencedatarepresentingapproximately2-4timesmore
speciesthanothermethods(Table1).Thememory(DRAMorDRAM+NVRAM)ofLMAT-MLisalso
higherthanforothermethods,althoughstillfeasibleforadesktop,particularlywhensupplemented
withlowcostNVRAM.
2
ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz,
ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz,
ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz
bioRxiv preprint first posted online Jan. 14, 2016; doi: http://dx.doi.org/10.1101/036681. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
Numberof
speciesin
reference
Method
databaseⱡ
ClinicalPathoscope
4280
Metaphlan2
7147
GOTTCHA
3461
MiniKraken
5538
Reference
Genome
DatabaseSize
(Gbases)
Memory
required
(GBytes)
7.8
0.76
Notavailable*
9.97
4.6†
3†
8®
4¥
16+24
NVRAMor24
112.10**
DRAM
LMAT-ML
12632
Table1:Referencegenomedatabasesizeandnumberofspeciesrepresentedbyeachmethod.
ⱡSpeciescountsdeterminedbycountingtheuniqueNCBIspeciesidentifiersfromallsequencesinthe
referencedatabase.
*Originalreferencesequencedatabasewasnotprovidedwithdownload,onlytheprecomputed,
compiledmarkerlibraryofk-mers.
**Thisdoesnotincludetheadditionalk-mersaddedfromthe1000HumanGenomesProject,since
thesewerefromunassembledgenomesasdescribedin[13].
†Estimatedusing/usr/bin/time,correctingforthebuginGNUtime
(http://stackoverflow.com/questions/10035232/maximum-resident-set-size-does-not-make-sense)by
dividingthemaximumresidentsetsizeby4.
¥Reportedby[9]
®Reportedbyhttps://github.com/poeli/GOTTCHA
Runtimeperformanceisdeterminedbydatabasesize
Table2showsthesizeofthedifferentMLdatabasesreflectingthedifferentpruningstrategies(shown
as1,10,andAllinthirdcolumnofthetable).Althoughtherewaslittledifferenceintaxonomic
classificationsbetween4ofthe5LMAT-ML(FigureS1),thechoiceofpruninglevelaffectsmemory
requirements.Forcomparison,LMAT-MLsincludeapproximately4.4-4.7timesmorek-mersthan
MiniKrakenbutrequire3-4.7timesmorememory.However,LMAT-ML-Minmaintainsalowerk-mer
perbyteratiothanMiniKraken(7.6versus11.2),whichislikelyexplainedbyLMAT’suseofasmallerk
(20versus32forMiniKraken).
bioRxiv preprint first posted online Jan. 14, 2016; doi: http://dx.doi.org/10.1101/036681. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
Method
Containshuman
k-mers?
MiniKraken
LMAT-ML-Min
LMAT-ML
LMAT-ML+H
LMAT-ML-All
LMAT-ML+H-All
-
no
no
yes
no
yes
Maximum
taxonomyids
storedperkmer
1
1
10
10
All
All
size(GB)
K-mers
4.0
12.1
15.9
16.8
18.1
18.9
357,913,941*
1,586,405,299
1,586,405,299
1,697,066,355
1,586,405,299
1,697,066,355
600000
Metaphlan2
LMAT−ML−ramfs
LMAT−ML−Min
LMAT−ML−Human−All
LMAT−ML−Human
LMAT−ML−All
LMAT−ML
Kraken
GOTTCHA
Clinical Pathoscope
0
200000
bp/(s*cpu)
1000000
Table2:Databasesizesandk-mercounts(forLMAT)givenconfigurationsthatvarythepresenceof
humanreferencek-mersandthemaximumcountoftaxonomyidslistedperk-mer.*Forcomparison,
weincludeanestimateofMiniKraken’sk-mercountbasedon12bytesperk-merasprovidedbythe
authors.[9]
Figure1:ProcessingrateforLMATusing5databaseconfigurationsandfouradditionalprocessing
methods,reportedperCPU.5ofthe6LMAT-MLrunsuseSATAIISSDforstorage.LMAT-ML-ramfs,the
LMATdatabaseisstoredinmainmemoryona“ramdisk”(linux“ramfs”)filesystem.Inthisandthe
followingfigures,MiniKrakenislabeledsimplyas‘Kraken’.
bioRxiv preprint first posted online Jan. 14, 2016; doi: http://dx.doi.org/10.1101/036681. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
Figure1showsprocessingratesforLMATcomparedwiththefourclassificationtools.Eachboxplotbar
encompassestheresultsfrom8HMPsampleruns.FortheLMATrunswehaveconfiguredfivemarker
libraries,andquerythemusing16GBofDRAMandaSATAIIOcz500GBsolid-statedrive.Tocompare
performancewhenstoringthedatabasecompletelyinDRAMmemory,weshowtheperformanceofthe
LMAT-MLdatabaseonacomputeplatformwith24GBofDRAM,wherethedatabaseindexcanresidein
mainmemorywithoutpagingfromtheexternalstoragedeviceusingaramdisk(linuxramfs),as
indicatedbylabelLMAT-ML-ramfs.Fromtheseresultswemakeseveralobservations.
(1) MiniKrakenhasapproximately5timesfasterperformancethanourbaselineLMAT(atmost10
taxonomyidentifiersperk-mer).Itsdatabasesizeisconsiderablysmallerat4GBincontrastto
theLMATdatabases(Table2);thus,itshouldhavegreaterCPUcachehitratesonaverage,
whichcanhaveaprofoundimpactonperformance.
(2) LMAT-ML-Min(singetaxonomyidentifierperk-mer)showsfasterperformancethanour
ramdiskrun(LMAT-ML-ramfs).Thisidentifiestheincreaseincostofprocessingmoretaxonomy
identifiers.
(3) LMAT-ML(SSDconfiguration)iswithin4%ofLMAT-ML-ramdisk.Whileweestimatethatas
muchas15%ofthedatabasewouldnotbeavailableinbuffercacheatanymoment,thisresult
indicatesthatthisratioofdatabasesizetototalDRAMdoesnotcreateasubstantialburdenon
thesystem’scachingmechanism.
(4) TheLMAT-ML+HandbothLMAT-ML-Alldatabasesshowslowerperformance.Inallcases,the
largerdatabasesizemeanslessofitcanfitinbuffercache,thusrequiringmoreNVRAMaccess
operations.Theadditionofhumanappearstohaveagreaterimpactonperformancethanthe
increaseintaxonomyidentifiersperk-merindicatingthatthenumberofdistinctk-mersstored
intheindexnecessitatesmoreNVRAMaccessoperations.
(5) Metaphlan2isfasterthanallLMAT-MLinstances,butLMAT-ML-Miniswithin20%.
(6) AllLMATconfigurationsoutperformGOTTCHAandClinicalPathoscopeintermsofspeed,
althoughClinicalPathoscopeisclosetotheLMATconfigurationsthatcontainhumanreference
information.
Comparingtheorganismsdetectedbyeachmethod
SIANNfoundnoneoftheorganismsinitsdatabaseforanyoftheHMPsamples.Wemanuallysearched
forseveralpathogensthatwereincludedintheirdatabaseandwhichweredetectedbyboth
Metaphlan2andLMAT-GrandbutnotSIANN,andtheseincludedClostridiumsymbiosum,Burkholderia
cenocepacia,Staphylococcusepidermidis,S.caprae,S.hominis,S.lugdensis,andS.aureus.Weraniton
severalsampleswithknownspikedpathogenconcentrations[18]toverifythatwewererunningthe
softwarecorrectly.Inasamplewith10,000genomeequivalents(GE)ofBacillusanthracis,Burkholderia
pseudomallei,Francisellatularensis,andYersiniapestisandanunknownconcentrationofBrucella
spikedintoabackgroundofhumanDNAandsequencedonIonTorrent,SIANNdetectedthe5spiked
pathogenspecieswithhighconfidencebutwasunabletomakeacorrectstraincall.Italsomade2false
positivespeciescallsfororganismsthatwerenotpresent(FrancisellanovicidaandFrancisellacf).At100
GEspikein,onlyanincorrectstrainofF.tularensiswasdetectedwithlowconfidence,andnoneofthe
otherpathogenspresentweredetected.LMATGrandandLMAT-ML+Hdetectedall5pathogensat
10,000and100GE.LMATcalledthevastmajorityofreadsatthespecieslevel(exceptforBrucella
genus)indicatingtheseregionsareconservedacrossmultiplegenomes.LMATwith10,000GEcorrectly
calledY.pestisHarbinasthetopstrain,andintheothercasesthetopstrainwasamongthetop3
bioRxiv preprint first posted online Jan. 14, 2016; doi: http://dx.doi.org/10.1101/036681. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
genomeswiththemostgenome-specificreads.WedidnotinvestigateSIANNfurther,asitwasnot
designedtobeageneralpurposemetagenomicsanalysistool.
Animportantcomponentofidentifyingtheorganismspresentinametagenomeismeasuringthe
numberofreadsleftunclassified.Evenasmicrobialdiversityingenomicdatabasescontinuestogrow,
thepotentialfornovelgenesororganismstoremain‘hidden’inthesampleremainshigh.Thus,the
abilitytoreducethenumberofreadsthatmustbeconsideredforassemblyorothermoreindepth
analysisthroughtimeconsumingsensitiveproteinsearcheswithBLASTorprofileHiddenMarkov
Modelsmustbeconsideredwhenevaluatingthecompletenessofamethod'staxonomicprofiling.
LMAT-Grandclassifiedonaverage83%ofthereadspersample,followedbyLMAT-ML+H(63%),
MiniKraken(35%),ClinicalPathoscope(29%),GOTTCHA(14%),andMetaphlan2(5%)(Figure2A).LMAT-
Granddetectedanaverageof178species/sample,LMAT-ML+H154species/sample,MiniKraken108
species/sample,ClinicalPathoscope67species/sample,Metaphlan242species/sample,andGOTTCHA
27species/sample(Figure2B).
Figure2A:Boxplotsshowingthefractionofreadsthateachmethodclassifiedfromthe131HMP
samples.
bioRxiv preprint first posted online Jan. 14, 2016; doi: http://dx.doi.org/10.1101/036681. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
Figure2B:Boxplotsofthenumberofspeciesthateachmethodclassifiedfromeachofthe131HMP
samples.LMATGranddetectedonaverage11%morespeciesthanLMAT-ML+Hand57%morespecies
thanMiniKraken,thenexthighestmethod.BLASTresultssuggestalargenumberofMiniKrakencallsare
falsepositives.
TaxonomycallsmadeusingLMAT-Granddonotrepresentacompletegroundtruthoforganisms
presentinasample.Sincethedatabasedrawsonthemostcompletecollectionofsequencedgenomes
amongdatabasesconsidereditisusefultomeasuretherelativeconcordanceamongthedifferent
methodsrelativetoLMAT-Grand.TheoverlapbetweeneachmethodandLMAT-Granddiffered
substantiallyasillustratedinFigure3,summingspeciescallsinagreementorindisagreementwiththose
ofLMATGrandacrossthe131samples.MiniKrakenandClinicalPathoscopeshowedrelativelylittle
overlapwithLMAT-Grandspeciescalls,GOTTCHAandMetaphlan2overlappedbythemajorityoftheir
speciescallsbutonlycoveredasmallfractionoftheLMATGrandcalls,andLMAT-ML+HandLMAT
Grandagreedalmostentirely.
bioRxiv preprint first posted online Jan. 14, 2016; doi: http://dx.doi.org/10.1101/036681. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
Figure3:Venndiagramsillustratingthetotalnumberandoverlapofspeciescallsforthe131HMP
samplesbyeachmethodwithLMATGrand,inorderofincreasingoverlap.
ThecallsbyGOTTCHA,MiniKraken,Metaphlan2,andClinicalPathoscopesharehighersimilaritywith
thoseofLMAT-Grandatthegenuslevelthanatthespecieslevel,althoughneitherspeciesnorgenus
agreementisveryhigh(Figure4).ToconsiderthepossibilitythattheLMAT-Grandoutputwaserror
prone,adetailedanalysisofthediscordantcallswereexaminedusingBLASTalignmentsagainsta
comprehensivemicrobialdatabase.
bioRxiv preprint first posted online Jan. 14, 2016; doi: http://dx.doi.org/10.1101/036681. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
Figure4:Thefractionofspecies(circles)orgenera(crosses)detectedbyLMATGrandthatwerealso
detectedbyagivenmethodversusthefractionofspeciesdetectedbyagivenmethodthatwerenot
detectedbyLMATGrand,averagedacrossthe131HMPsamples.IfthecallsbyLMATGrandarecorrect
asindicatedbyBLASTresults(seebelow),thenthisisakintosensitivityversusfalsediscoveryrate.
CallsmadebyothermethodsthatwerenotsupportedbyLMATGrand
AllthemarkerlibrarymethodsexceptLMAT-MLdetectedanumberofspeciesineachsamplethat
LMAT-Granddidnot(Figures3and4).Readswereextractedfromagivenmethod’smappingresultsfor
the10mostcommonlyoccurringspeciescallsthatdifferedfromLMATGrand.Figure5showsthesum
oftheseextractedreadsacrosssamplesandspecies,totalingbetween10,000togreaterthan1Mreads
fornon-LMATmethods.WithordersofmagnitudefewerreadsforLMAT-ML+Hpotentialfalsepositives,
manualinspectionrevealedthatthedifferencebetweenLMAT-ML+HandLMAT-Grandlayinminor
quantitativedifferencesclosetothethresholdreadcountforcallingaspeciesaspresentratherthan
qualitativelydifferentcalls.BLASTsearcheswiththesereadsagainstthecomprehensivemicrobial
genomedatabaseprovidedevidenceinsupportorcontradictionofthepotentialfalsepositives,
summarizedinFigure6anddescribedindetailforeachmethod.
bioRxiv preprint first posted online Jan. 14, 2016; doi: http://dx.doi.org/10.1101/036681. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
Numberofreadsfortop10most
commonlycalledspeciesnot
detectedbyLMAT-Grand
10,000,000
1,000,000
100,000
10,000
1,000
Kraken
Clinical
GOTTCHA Metaphlan2 LMAT-ML+H
Pathoscope
Figure5:Numberofcandidatefalsepositivereadstotaledacrosssamplesforthetop10mostcommon
speciescallsthatwerenotdetectedbyLMATGrand.
Figure6:FractionofbestBLASTmatchestospeciescalledbytheindicatedmethodthatwerediscordant
withspeciescallsbyLMAT-Grand.
bioRxiv preprint first posted online Jan. 14, 2016; doi: http://dx.doi.org/10.1101/036681. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
Thesewereforthemostcommonlycalledspeciesbytheindicatedmethodthatwerenotdetectedby
LMAT-Grandinthosesamesamples.Inallnon-LMATmethods,onlyaminorityofBLASThitstoa
comprehensivesequencedatabasesupportedthespeciescalledbythatmethod,suggestingthatthese
wereincorrectspeciescalls.CallsbyLMAT-ML+Hhadamajorityofmatchestothecorrectspecies,on
average,butnearlyasmanymatchestootherspeciesinthegenus,supportinggenuslevelcallsforsome
ofthesereads.BLASTresultsprovidedstrongsupportforLMAT-Grandcallsthatwerenotdetectedby
othermethods.Amoredetailedanalysisofthediscordantcallsforeachmethodisnowgiveninthe
followingsections.
MiniKrakenunsupportedcalls
MiniKrakenmadeanumberofunsupportedcallsthatwererepeatedlyseenacrosshalformoreofthe
samples.ForthemostcommonMiniKrakenspeciescallsthatwerenotconsistentwithLMATspecies
calls,7ofthe10had5%orfewerBLASThitstothespeciesidentifiedbyMiniKraken.Theother3still
hadaminoritywith19-38%ofBLASTmatchestotheMiniKrakenspecies,andthemostcommonLMAT
classificationforthesereadswasatthegenuslevelorabove.TheunsupportedMiniKrakenspeciescall
withthebulkofreads,over2million,had99.5%ofreadsmatchingthedatabaseofsyntheticsequences.
TheseBLASTresultsindicatethatMiniKrakencallswereeitherincorrectand/oroverlyspecific,and
failedtocorrectforcontaminationofreferencesequenceswithvectorandotherartificialsequence
contaminants(SupplementaryTableS2).
ClinicalPathoscopeunsupportedcalls
ClinicalPathoscopecommonlycalledanumberofbacteriainnearlyhalfthesampleswhichwerenot
supportedbyLMAT-Grandcalls.OfreadsmappedtoorganismscalledbyClinicalPathoscopebutnot
LMAT-Grand,mosthadfewerthan1%ofBLASThitstothisorganism,suggestingthatorganismswiththe
highestsimilaritytothesereadswerenotintheClinicalPathoscopedatabase,sothereadsweremisattributedtootherspecieswithhomology(SupplementaryTableS3).Thereweretwoexceptionsfor
whichmorethan10%oftheloweste-valueBLASTmatchesweretotheorganismcalledbyClinical
Pathoscope:Propionibacteriumacneswith28%,andBacteroidesvulgatuswith17%,althoughtheseare
stillaminorityofBLASTmatches.LMAT-GrandclassifiedmostofthoseP.acnesreadsas
genus,Propionibacterium,order,Actinomycetales,class,Actinobacteria,andcellularorganisms.LMATGrandcalledthemajorityofthoseB.vulgatusreadsattheBacteroidesgenuslevel,inadditiontosome
thousandsofreadstoothergenera,orderBacteroidales,phylumBacteroidetes,andsuperkingdom
Bacteria.ThefactthataminorityofBLASThitsaretotheorganismidentifiedbyClinicalPathoscope
suggestthatcallsbyClinicalPathoscopemaybeoverlyspecific,withequivalentsimilaritytomany
species.Inaddition,upto19%ofthereadsforsomeoftheseunsupportedClinicalPathoscopecallshad
BLASTmatchestoartificialsequences,indicating,thismethodfailstoadequatelycontrolfor
contaminationofthereferencedatabasebyartificialsequencessuchasvectorsandadaptors.
GOTTCHAunsupportedcalls
PhagedominatedthecallsuniquetoGOTTCHA.AllthereadswhichGOTTCHAlabeledasEnterobacteria
phagelambdamatchedourvectorandothersyntheticsequencesdatabase,andLMATlabeledalmostall
thesereadsassyntheticconstructsorroot(SupplementaryTableS4).23%ofthereadsthatGOTTCHA
calledasEnterobacteriaphagephiX174sensulatohadBLASTmatchestoourvectordatabase,andLMAT
labeledthemasroot,superkindgomBacteria,syntheticsequences,orcellularorganisms.Elementsof
thesephageareusedforgeneticengineeringandascontrolsincertainIlluminasequencingprotocols,
bioRxiv preprint first posted online Jan. 14, 2016; doi: http://dx.doi.org/10.1101/036681. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
soweincludetheminthevector/syntheticdatabase,sothisresultisnotsurprising.Whiletheothercalls
didnothavemanyBLASTmatchestoartificialsequences,onlyaminorityofthetopBLASThitsmatched
theGOTTCHAcallsincethesereadswerewidelyconservedacrossspeciesorhigherranks,andLMAT
identifiedthesereadslargelyasrootorgenuslevelcalls.
Metaphlan2unsupportedcalls
Therewere4organismscommonlycalledbyMetaphlan2thatwereunsupportedbyLMAT-Grandcalls:
DasheenMosaicvirus,Streptococcussp.GMD4S,Catonellamorbi,andAbiotrophiadefectiva,plussome
otherunsupportedcallsthatoccurredinjustafewsamples(SupplementaryTableS5).Incontrastto
ClinicalPathoscope,Metaphlan2uniquecallshadveryfewmatchestoartificialsequence.Readsthat
Metaphlan2uniquelylabeledasDasheenmosaicvirusandViciacrypticvirushadnobestBLASTmatches
totheseviruses,promptingustodoublecheckthattheseviruseswereindeedpresentinourBLAST
database.LMATcallsforthesereadswerepredominantlytoHomosapiens,plussmallnumberstohigh
levelclassificationssuchastaxonomyrootnode,cellularorganisms,andvariousbacterialgenera.For
someoftheorganisms,48%-68%oftheBLASTmatchesweretothecorrectorganism,buttherewereas
manymatchesofthesamequalitytootherorganisms,suggestingoverlyspecificMetaphlan2callsto
readsconservedatthegenusorphylumlevel.Supportingthisobservation,LMATcallsforthose
particularreadswereoverwhelminglyatthephylumandgenuslevels,andevensometothekingdom
Bacteriaandtocellularorganisms.ForotherMetaphlan2speciescalls,LMATGrandclassifiedmoreof
thosereadstoseveralnearneighborspeciesinthesamegenus,withonlyasmallsubsetofthosereads
classifiedasthesamespeciesasMetaphlan2butnotenoughtoreachtheminimumcallthresholdof
100readsinthosesamples.
LMAT-ML+Hunsupportedcalls
UnsupportedcallswerelesscommonandlessconsistentacrossmultiplesamplesforLMAT-ML+Hthan
fortheothermethods,andinvolvedfewerreads(Figure5).WhenweinvestigatedLMAT-ML+H
differencesfromtheGranddatabase,mostofthedifferenceswereinsampleswherethenumberof
readscalledhoveredaroundthe100readthreshold,sothatspeciescalledbyLMAT-ML+Hwerealso
observedintheGrandresultsbutatabundancesjustbelow100reads(Table3).LMAT-Grandcallsfor
theextractedspeciesreadsfromtheLMAT-ML+Hrunsalwaysincludedsomespeciescallstothespecies
detectedbyLMAT-ML+Hforthosereads,aswellasalargernumberatahighertaxonomylevel.Manual
inspectionofmanyexamplesshowedthatbothLMAT-ML+HandLMAT-Grandalwayscalledmultiple
speciesinthegenus,althoughthedistributionofreadscountsamongthosespeciesdiffered.The
averagescoreperspecieswasusuallylowerforLMAT-ML+HthanGrand.LMATcalculatesalogodds
readscorefromthenumberofk-mermatchesinareadrelativetoanullmodelsimulatedwithrandom
sequencesforeachdatabase,adjustingforGCcontentandreadlengthofthenullmodeltomatchthat
oftheread.[14]TheLMAT-ML+Hdatabasehasmanyfewerk-mersthantheGranddatabase,resultingin
loweraveragescores.Lowscoresindicateanorganismhasbestsimilaritytothattaxonomiccall,but
doesnotperfectlymatchthereferencesequence,suggestinganovelvariant.ThecallstoTetrahymena
thermophilahadaveragereadscoresofjustabove0.2forbothLMAT-ML+HandGrand,andsowere
belowthethresholdforGrandbutnotLMAT-ML+H.ThesereadshadBLASTmatchesonlyto
PlasmodiumyoeliiandTetrahymenathermophila,andwereveryrepetitiveandATrich.Classificationof
suchlowcomplexityreadsisachallenge,especiallyconsideringthedifficultyofassemblingreference
genomesforprotozoaaroundsuchregions,withthepotentialformisassemblyandcontaminationwith
hostandothereukaryoticDNA.Insummary,differencesbetweenLMATGrandandLMAT-ML+Hwere
bioRxiv preprint first posted online Jan. 14, 2016; doi: http://dx.doi.org/10.1101/036681. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
morelikelyfororganismswithlownumbersofreadsorlowscores,i.e.thosewithlowabundanceorlow
similaritytothatgenomeinthedatabase.
Fraction
reads
Number
Fraction with
samples
BLAST
matches
with
matches to
unsupported tocalled synthetic
Species
call
species
contructs
Streptococcus_sp._M143
21
0.506
0
genus,Streptococcus
Streptococcus_sp._GMD4S
20
0.164
0
genus,Streptococcus
Streptococcus_sp._oral_taxon_071
15
0.279
0
genus,Streptococcus
Streptococcus_sp._AS14
13
0.765
0
genus,Streptococcus
Lachnospiraceae_bacterium_MSX33
12
0.607
0
order,Clostridiales
Streptococcus_sp._GMD6S
12
0.333
0
genus,Streptococcus
Bacteroides_vulgatus
11
0.852
0
genus,Bacteroides
Tetrahymena_thermophila
11
0.5*
0
norank,cellularorganisms
Streptococcus_pseudopneumoniae
11
0.658
0
genus,Bacteroides
Streptococcus_mitis
10
0.852
0
genus,Streptococcus
Table3:BLASTanalysisofunsupportedcallsbyLMAT-ML
*BLASTedagainstadatabaseofprotozoasequencesaswellasbacterialandviral,andusing-dustnoword_size20optionstoBLASTinadditiontotheoptionsdescribedinmethods
CallsmadebyLMAT-Grandthatwerenotdetectedbyothermethods
LMAT-GrandcallshadstrongBLASTsupport:theLMAT-identifiedspeciesdominatedtheBLASTmatches.
ThissupportsthenotionthatthesearefalsenegativesbyGOTTCHA,MiniKraken,ClinicalPathoscope,
andMetaphlan2,allofwhichusesubstantiallysmallerreferencedatabasesthanLMAT-Grand(Table4).
Thesewerenotisolatedfalsenegatives,asmostoccurredinoverhalftheHMPsampleswecompared.
Incontrast,forLMAT-ML+H,eventhemostcommonfalsenegativesoccurredinlessthanathirdofthe
samples.FortheLMAT-ML+Hcomparisons,ineverycaseweexamined,thespecieswasactually
detectedbytheLMAT-ML+Hbutitfellunderthe100readcountthreshold,soitwasamatterofslight
quantitativedifferencesnearthethresholdratherthanqualitativedifferences,thesameresultas
discussedabovefortheLMAT-ML+Huniquecalls.ThissuggeststhatwhenusingLMAT-ML+H,
adjustmentsshouldbemadetoaccountforlowersensitivityofLMAT-ML+HthanLMATGrand.LMATML+HusestheidenticaldatabaseofreferencegenomesasLMATGrand,downselectingtoafractionof
themosttaxonomicallyinformativek-mersandremovingredundancybyeliminatingmanyofthe
overlappingk-mers,soitisnotsurprisingthatdifferencesbetweenthetwodatabasesaresmall.
bioRxiv preprint first posted online Jan. 14, 2016; doi: http://dx.doi.org/10.1101/036681. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
Numberof
samples
withcallby
Number
LMAT
reads,
Grandbut
summed
notother
across
Species
method
method*
samples
Veillonella_dispar
83
CP,G,K
127783
Streptococcus_infantis
81
CP,G,K
777190
Granulicatella_adiacens
81
CP,G,K
37017
Gemella_haemolysans
80
CP,G,K
1099653
Porphyromonas_sp._oral_taxon_279
78
CP,G,K
13566
Prevotella_oris
77
CP,G,K
605452
Prevotella_salivae
77
CP,G,K
4050496
Leptotrichia_wadei
77
CP,G,K
3447770
Fusobacterium_periodonticum
76
CP,G,K
11218643
Lachnoanaerobaculum_saburreum
76
CP,G,K
823744
Streptococcus_oralis
78
M
409051
Tannerella_sp._oral_taxon_BU063
73
M
3267593
Streptococcus_pneumoniae
73
M
410193
Streptococcus_agalactiae
70
M
97079
Prevotella_sp._ICM33
70
M
1433785
Veillonella_sp._oral_taxon_158
68
M
8530
Neisseria_mucosa
68
M
277668
Prevotella_sp._F0091
68
M
411478
Actinomyces_sp._oral_taxon_175
62
M
14421
Streptococcus_sp._M334
62
M
456562
Streptococcus_suis
37
LMAT-ML+H
1820
LMAT-ML+H
Candidatus_Saccharimonas_aalborgensis
31
8636
LMAT-ML+H
Prevotella_sp._HJM029
26
4381
LMAT-ML+H
Staphylococcus_sp._DORA_6_22
26
8839
LMAT-ML+H
Streptococcus_sp._I-P16
26
7223
LMAT-ML+H
Lactococcus_lactis
25
690
LMAT-ML+H
Aggregatibacter_actinomycetemcomitans
23
1095
LMAT-ML+H
Actinomyces_urogenitalis
22
448
LMAT-ML+H
Streptococcus_sp._oral_taxon_056
21
122
LMAT-ML+H
Streptococcus_sp._HSISS2
20
4621
Table4:BLASTanalysisofcallsmadebyLMATGrandbutnotothermethods.
*CP=ClinicalPathoscope,G=GOTTCHA,K=Kraken,M=Metaphlan2
FractionBLAST
matchesto
LMATGrand
calledspecies
0.949
0.956
0.927
0.995
0.931
0.998
0.999
0.680
0.992
0.833
0.935
1.000
0.957
0.986
0.940
0.878
0.759
0.919
0.765
0.876
0.917
0.273
0.917
0.921
0.896
0.991
0.883
0.616
0.656
0.929
bioRxiv preprint first posted online Jan. 14, 2016; doi: http://dx.doi.org/10.1101/036681. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
DetectingEukaryotes
TheLMATdatabases(allvariants)aretheonlymetagenomicanalysistoolsthatincludeEukaryotic
sequencesinadditiontohumaninthereferencedatabase.Asaresult,wewereabletoclassifyreads
matchingfungi,protozoa,plants,andanimals(TableS1).Themajorityofeukaryoticreadswerehuman,
followedbyFungiphylumDikarya,inwhichmanygeneraandspeciesweredetected.Readswere
detectedfromdog(Canus,Canuslupus,orCanuslupusdomesticus;3sampleswith49-578readseach).
ThefungalgenusMalasseziahadparticularlylargenumbersofreadsinmanysamples,anexpected
resultforagenusnaturallyfoundontheskinofmanyanimals.Smallnumbersofreadsforvarious
pathogenicprotozoainthephylumApicomplexaweredetected,includingPlasmodium,Acanthamoeba
castellanii,Eimeria(coccidiosisinlivestock),Hammondiahammondi,andToxoplasmagondii.Five
samplescontainedunusuallyhighnumbersofreads(~20,000)ofHammondiahammondi,whichrelies
oncatsasitsdefinitivehost.Onestoolsamplehad~2,700readsofBlastocystishominis,a
gastrointestinalparasiteofdisputedpathogenicity.Severalsamplescontained1000-3000readsof
Entamoebanuttalli,knowntocauseillnessinnon-humanprimates.Thefreshwaterciliatedprotozoan
Tetrahymenathermophilawasdetectedinanumberofsamples,mostoftenwithlowreadcount,except
afewcaseswiththousandsofreads.Oneretroauricularcreasesamplehad~9,500readsclassifiedas
Trypanosomacruzi(Chagasdiseaseorsleepingsickness),whichcanpersistunnoticedinthehostfor
decades,althoughdetectiononskinmaynotmeanthehostisinfected.Smallnumbersofreadsof
Trichomonasvaginalisweredetectedinahandfulofsamples.Whilethebulkofcallswerebacterial,
theseobservationssuggestthatfurtherstudyconfirmingthepresentofeukaryotereadsintheHMP
dataiswarranted.Manydrafteukaryoticsequencesappeartocontainmisassembledhumanand
vector/syntheticsequence,andwehaveinvestedsubstantialefforttocontrolandcorrectforthis
contamination[14].WeconductedspotcheckswithBLASTonreadsassignedtoeukaryotes(including
MalasseziaandCanuslupus)toconfirmthatobviousmiss-assignmenterrorswerenotpresent.
Nevertheless,non-humaneukaryote-classifiedreadsdeserveanextrameasureofcaution,suchas
demandinghigherreadcountorscorethresholdsforcallingaspeciesaspresent.LMAT-ML+Hisless
sensitivethanLMATGrandfortheseorganisms,asmanualcomparisonsofseveralofthesespecies
showedfewerreadsandlowerscoresintheML,butitisstillcapableofdetectingtheseorganismsusing
16-24GBmemory.
Discussion
SensitivitywastoolowforSIANNtobefeasibleforclinicalsamples,andfalsepositiveswereseenin
sampleswithpathogenathighspikedconcentrations.MiniKrakenprocessedmetagenomereadsthe
fastest,butitwasalsotheleastaccurateontherealworldHMPsamples,withpoorBLASTsupportfor
thespeciesitcommonlydetectedandalargenumberofmissedspeciesthatdidhavestrongBLAST
support,andclassifiedanaverageofaboutathirdofthereads.ClinicalPathoscopeandGOTTCHAalso
hadpooraccuracy,theyweretheslowestclassifiersinourtests,andtheyfailedtoclassifyevenlarger
percentagesofreads.MetaPhlan2wasfasterthanLMAT-ML,althoughitonlyclassifiedanaverageof5%
ofreadsandmissedmanyspeciesthatwereclearlypresentbasedonBLASTresults.LMAT-MLranon
averageatabouthalfthespeedofMetaPhlan2andrequired24GBofDRAMtoavoidanyperformance
penaltyforpagingthedatabaseindex,morethanothermarkerlibrarymethods.However,LMAT-ML+H
classifiedover60%ofreadsandshowedfarbetteraccuracythanothermarkerlibrarymethodsas
verifiedbyBLAST,deliveringresultsnearlyascompleteasthoseofLMATGrandwithonlyafractionof
thememoryrequirements,ataspeedcapableofanalyzingagigabase-sizedsampleinabout3.5minutes
bioRxiv preprint first posted online Jan. 14, 2016; doi: http://dx.doi.org/10.1101/036681. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
with24CPUand24GBofDRAM.Additionally,theperformancepenaltyfor16GBversus24Gbis
roughly50%slowerwiththesmalleravailableDRAM(whenusingLMAT-ML+H).
TheprocessingrateofLMAT-MLdatabasesareaffectedbytwokeyfactors,therateatwhichtheindex
canbeaccessed,andtheclassificationrateofanindividualread,whichisimpactedbythenumberof
taxonomyidentifiersretrievedforeachconstituentk-merintheread.Fastaccesstotheindexis
impactedbythesizeofthedatabase,withlargerdatabasesrequiringadditionalpaging.
WeconsidertheuseoftheSATAIISSDasitprovidesarelativelylowcostalternativetoDRAMfor
storage.CurrentadvertisedpricesforDRAMareabout$11andforSATASSD$0.75pergigabyte,and
midtierPCIeflasharound$5/GB.WepresenttheuseofanSSDwithLMATandlimitedmemoryin
contrasttopreviousexperimentswithPCIeflashwithlargerLMATdatabaseindexes[19].Additional
experimentationwiththelargerGranddatabaseandSATASSDwithlimitedmainmemoryhave
demonstratedsubparperformance.ForthesmallerMLdatabases,however,theperformancereduction
withSATASSDwasminor,sincedemandsonmainmemoryweremuchlower.AlthoughLMATruns
fasterinDRAMonly,weestimatethatalaptop/desktopwith16GBRAManda24GBflashdrivewill
performrapidandaccuratemetagenomeanalyseswithLMAT-ML+H.
Whileallmethodsaimtoclassifyreadsatthemostspecificlevelpossible,thatlevelofspecificitymust
besupportedbythedata.AllofthemethodsotherthanLMATfailedtoidentifygenus,family,phylum,
orevenhigherlevelsofconservationintheHMPreads,andthusreportedoverlyspecificcalls.
DatabaseslikeRefSeqhaveonlyonerepresentativesequenceformanyspeciesandsomegenera,and
documentationexplicitlystatesthatmorethanonestrainwillbeincludedonlyinexceptional
circumstancesasdeterminedmanuallybyNCBIstaff[20].Thisrenderssuspectanystraincallsmadeby
classifiersrelyingonRefseq,sincetherearenotenoughnearneighborstoresolveatthislevel.In
addition,MiniKraken,ClinicalPathoscope,andGOTTCHAmadeerrorsbymisclassifyingreadsas
microbialthatweremuchmorelikelytobeartificialsequencesfromsamplepreparation.Metaphlan2
didnotmisclassifyartificialsequences,butitdidmisclassifyhumanreadsasviral.Weandothers[9,21]
havefoundsubstantialcontaminationofdraftgenomeswithadaptor,vector,andothersynthetic
sequencesandhumanorotherhostsequences.LMAT-GrandandLMAT-MLspecificallylabelsynthetic
andhumank-mersandthenapplyagreedystrategytodetectreadswiththesesequences.Thisallows
theLMATdatabasetocontainlargenumbersofdraftsequencestospannovelstrainandspecies
diversitywithoutmisclassifyinghumanandsyntheticreadsasmicrobialduetocontaminateddraft
assemblies.
MiniKraken,ClinicalPathoscope,andGOTTCHAreferencesequencesconsistofNCBIRefSeqcomplete
bacterial,archaeal,viralgenomesandthehumanreferencegenome,andMetaphlan2andSIANNuse
evensmallersubsetsofthesesequences.WithLMAT-GrandandLMAT-ML,weextendthereference
databasetospaneverymicrobialgenomeinthepublicdomain,andmore,sothat
LMATistheonlymetagenomeclassificationsoftwarethatincludes1)eukaryoticsequenceinboththe
GrandandtheLMAT-MLdatabases,enablingtheclassificationoffungi,protozoa,andsomemulticellular
organisms(fromorganelleslabeledaswholegenome,e.g.mitochondriaandchloroplasts);2)draft
genomesandassembledcontigsnotcontainedinNCBIRefSeq;and3)draftandfinishedbacteria,virus,
archaea,fungi,andprotozoagenomesfromanumberofsequencingcentersworldwidewithpublicly
availablesequencedatainadditiontothosefromNCBI.Whilesequencesavailablefromthesesites
eventuallyappearinNCBIdatabases,theymaybepubliclyavailableyearsbeforereleaseatNCBI,and
manystrainsmayneverbecomeapartofNCBIRefSeq.
bioRxiv preprint first posted online Jan. 14, 2016; doi: http://dx.doi.org/10.1101/036681. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
SinceLMATincludesallavailablestrains(genomes)thathavebeensequenced,itshouldprovidemore
accuratespeciesresolutionandreportingofthecloseststrainsinthedatabasethanothermethods.
LMATreportsthenumberofreadsmatchingmultiplestrains,andforthosereadsconservedacross
multiplestrains,thenitreportsonlythespecieslevelmatch,sinceNCBItaxonomynodesingeneraldo
notexistforclades.Evenforisolateswithsequencedgenomes,theremaybemorethanonebeststrain
matchfordifferentsubsetsofreadsduetoevolutionfromtheoriginalsequencedisolateduring
propagation,lateralgenetransfer,recombination,andsequencingerror.Aphylogeneticapproachis
probablynecessarytoaccuratelyplacenewsequenceinthebroadercontextofotherisolates,possibly
usingassembly,alignmentorSNPs,providedthereissufficientgenomecoverage.Othermethodsthat
failtoincludemanyspeciesandstrainsintheirreferencedatabasecannotresolvespecificstrains.
Resultspresentedhereshowthatthesemethodsareincorrectoroverlyspecificevenintheirspecies
andgenuslevelclassifications.
NotonlyisLMATtheonlymethodthatcandetecteukaryoticsequences,itreportscallstoplasmids
versuschromosomesfordistinguishingthepresenceofthesemobilegeneticelements.Metaphlan2,
GOTTCHAandMiniKrakendonotdetectplasmids,andmakeonlytaxonomiccalls.ClinicalPathoscope
doesidentifyreadsbydatabaseentry,soitispossibletodistinguishplasmidfromchromosomal
matches.LMATdistinguishesplasmidcalls,andcreatesafileexclusivelylistingtheplasmidsdetected,
andalsoincludesthoseplasmidcallsintheoverallresultssummarywithalltaxonomiccalls.For
methodsotherthanLMAT,thereisnoprocessdescribedinthemanualsforextractingreadsresponsible
foragivencall,makingitdifficulttoverifythosecalls,anddoadditionalanalysessuchasassembly,per
speciesgeneannotation,SNPanalyses,ordistinguishmatchestoplasmidversuschromosome.Plus,
failuretoreportstandardizedNCBItaxonomyidentifiersforallcallsbysomeofthemethods,plususeof
nonstandardoroutdatedspeciesnamesandGInumbersnotinthecurrentNCBIdatabasemakesthe
processespeciallychallenging.Weencouragesoftwaredeveloperstodescribeproceduresforextracting
readsforthetaxonomiccallsmadebythemethod,tofacilitatecallverificationfromthereads
responsibleforeachcall.
Alignmentbasedmethods(e.g.BLASTandreadmapping)scalelinearlywiththenumberofbasesinthe
referencedatabase.Toscalewithanevergrowingpoolofreferencegenomes,alignment-based
softwaremustreducetoonlyasubsetoftheavailabledatabyexcludingstrainvariants,draftgenomes,
andnon-microbialkingdoms.Asaconsequence,thesemethodsfailtoclassifylargenumbersofreads,
reportoverlyspecificclassificationsforsequenceswhichinfactaremorewidelysharedacrosstaxa,and
eithermisclassifyorfailtodetectallthespeciesandgeneramissingfromthedatabase.Inaddition,
alignmentsrequireacaponthemaximumnumberofalignmentstoreturntoretainreasonablerun
times.Wehaveobservedthatforhighlyconservedsequences(like16SrRNAorhousekeepinggenes)
wheretheremaybethousandsofadditionalunreportedmatchesoverthatmaximum,thesortorderof
reportingmatchescanresultinbiasesandoverlyspecificcallsfortaxathatmaynotactuallybepresent.
Incontrast,thek-merbasedapproachhastheadvantageofretainingandcondensingconserved
subsequencessothataddingrelatedreferencegenomesincreasesthedatabasesizeonlyfornovelkmersandthesmallincrementofaddingthatgenometagtoexistingk-mersalreadystored.Thus,the
databasesizegrowsasafunctionofsequencediversity,notasastrictlylinearincreasewiththenumber
ofbasesinthereferencedatabase.WehaveplanstoaddmoreeukaryoticgenomestotheLMAT
database(e.g.mosquitos,nematodes,ticks,plants),toclassifymorereadsfromenvironmentalsamples,
whichshouldbetremendouslyhelpfulinfieldssuchasbioenergy,microbialecology,industrial
metagenomics,andenvironmentalbiosurveillance.[4]Manyeukaryoteshaveextremelylargeand
repetitivegenomes[22],soak-merthatscaleswithdiversityratherthatthegenomesizeincluding
bioRxiv preprint first posted online Jan. 14, 2016; doi: http://dx.doi.org/10.1101/036681. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
moreeukaryotesshouldfurtherreducethenumberofunclassifiedreads,andimproveourabilityto
separatereadsfortrulynovel,unknownmicrobesforfurtheranalysis.
ComparingresultsfromactualHMPsamplesacross6metagenomeanalysissoftwarepackages,we
foundthattheLMATMarkerLibrary“LMAT-ML+H”classifiedmicrobialcontentsmostaccuratelyand
comprehensivelyduetoitsrelianceonareferencedatabase1-2ordersofmagnitudelargerthanthatof
othersoftwareandrepresenting2-4timesmorespecies.Itsspeediscompetitivewithothertools,and
althoughmemorydemandsarehigher,theyarestillwellwithinthepricerangeofastandarddesktop
machinewith24GBofmemoryor16GBmemorywithalowcostSSDdrive.
Acknowledgements
WethankMarisaTorresandClintonTorresforbuildingtheinfrastructuretodownloadandupdatethe
referencesequencedatabaseusedbyLMAT.ThisworkwasperformedundertheauspicesoftheUS
DepartmentofEnergybyLawrenceLivermoreNationalLaboratoryunderContractDE-AC52-07NA27344.
LaboratoryDirectedResearchandDevelopment(33-ER-2012and08-ER-2011);
ConflictofInterest:nonedeclared.
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
MillerR,MontoyaV,GardyJ,PatrickD,TangP:Metagenomicsforpathogendetectioninpublic
health.GenomeMedicine2013,5(9):81.
PadmanabhanR,MishraAK,RaoultD,FournierP-E:Genomicsandmetagenomicsinmedical
microbiology.JournalofMicrobiologicalMethods2013,95(3):415-424.
DiBellaJM,BaoY,GloorGB,BurtonJP,ReidG:Highthroughputsequencingmethodsand
analysisformicrobiomeresearch.JournalofMicrobiologicalMethods2013,95(3):401-414.
NeelakantaG,SultanaH:TheUseofMetagenomicApproachestoAnalyzeChangesin
MicrobialCommunities.MicrobiologyInsights2013,6(3641-MBI-The-Use-of-MetagenomicApproaches-to-Analyze-Changes-in-Microbial-Comm.pdf):37-48.
NaccacheSN,FedermanS,VeeeraraghavanN,ZahariaM,LeeD,SamayoaE,BouquetJ,
GreningerAL,LukK-C,EngeBetal:Acloud-compatiblebioinformaticspipelineforultrarapid
pathogenidentificationfromnext-generationsequencingofclinicalsamples.Genome
Research2014,24:1180-1192.
DesaiN,AntonopoulosD,GilbertJA,GlassEM,MeyerF:Fromgenomicstometagenomics.
CurrentOpinioninBiotechnology2012,23(1):72-76.
WilkeningJ,WilkeA,DesaiN,MeyerF:UsingCloudsforMetagenomics:ACaseStudy.2009
IeeeInternationalConferenceonClusterComputingandWorkshops2009:80-85.
SegataN,WaldronL,BallariniA,NarasimhanV,JoussonO,HuttenhowerC:Metagenomic
microbialcommunityprofilingusinguniqueclade-specificmarkergenes.NatMeth2012,
9(8):811-814.
WoodD,SalzbergS:Kraken:ultrafastmetagenomicsequenceclassificationusingexact
alignments.Genomebiology2014,15(3):R46.
ByrdA,Perez-RogersJ,ManimaranS,Castro-NallarE,TomaI,McCaffreyT,SiegelM,BensonG,
CrandallK,JohnsonW:ClinicalPathoScope:rapidalignmentandfiltrationforaccurate
pathogenidentificationinclinicalsamplesusingunassembledsequencingdata.BMC
Bioinformatics2014,15(1):262.
bioRxiv preprint first posted online Jan. 14, 2016; doi: http://dx.doi.org/10.1101/036681. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
AllenT,FreitasK,LiP-E,ScholzMB,ChainPSG:Accurateread-basedmetagenome
characterizationusingahierarchicalsuiteofuniquesignatures.NucleicAcidsResearch2015,
43(10):e69.
MinotS,TurnerSD,TernusKL,KadavyDR:SIANN:StrainIdentificationbyAlignmenttoNear
Neighbors;2014.
AmesSK,GardnerSN,SlezakTR,GokhaleMB,AllenJE:Usingpopulationsofhumanand
microbialgenomesfororganismdetectioninmetagenomes.GenomeResearch2015,25:10561067.
AmesSK,HysomDA,GardnerSN,LloydGS,GokhaleMB,AllenJE:Scalablemetagenomic
taxonomyclassificationusingareferencegenomedatabase.Bioinformatics2013,29(18):22532260.
AltschulSF,GishW,MillerW,MyersEW,LipmanDJ:Basiclocalalignmentsearchtool.Journal
ofmolecularbiology1990,215(3):403-410.
JurkaJ,KapitonovVV,PavlicekA,KlonowskiP,KohanyO,WalichiewiczJ:RepbaseUpdate,a
databaseofeukaryoticrepetitiveelements.Cytogeneticandgenomeresearch2005,110(14):462-467.
AllenJE,GardnerSN,SlezakTR:DNAsignaturesfordetectinggeneticengineeringinbacteria.
Genomebiology2008,9(3):R56.
FreyKG,Herrera-GaleanoJE,ReddenCL,ThissenJL,DyerM,AllredA,MokashiV,GardnerSN,
SlezakT:TowardsQuantitativeMetagenomics:TargetedEnrichmentforDetectionof
BioThreatAgents.In:IonWorld:October21-222013;Boston,MA.
AmesS,AllenJE,HysomDA,LloydGS,GokhaleMB:DesignandOptimizationofa
MetagenomicsAnalysisWorkflowforNVRAM.In:13thIEEEInternationalWorkshoponHigh
PerformanceComputationalBiology.May2014.
PruittK,BrownG,TatusovaT,MaglottD:TheReferenceSequence(RefSeq)Database.In:The
NCBIHandbook[Internet].EditedbyMcEntyreJ,OstellJ:NationalCenterforBiotechnology
Information(US);2002.
MerchantS,WoodDE,SalzbergSL:Unexpectedcross-speciescontaminationingenome
sequencingprojects.PeerJ2014,2:e675.
NeneV:Tickgenomics--comingofage.Frontiersinbioscience(Landmarkedition)2009,
14:2666-2673.