Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. Lowest expressing microRNAs capture indispensable information: identifyingcancertypes RoniRasnic1,NathanLinial1*andMichalLinial2* 1 SchoolofComputerScienceandEngineering,TheHebrewUniversityofJerusalem,Israel 2 Deptartment of Biological Chemistry, Institute of Life Sciences, The Hebrew University of Jerusalem,Israel * Correspondingauthors MichalLinial DeptofBiologicalChemistry(ML),TheHebrewUniversity,EdmondJ.Safra,Jerusalem,91904 [email protected]: 972-2-6584884FAX: 972-2-6586448 Figures:7/Table:1/SupplementalTables:3 Runningtitle:LowestexpressingmicroRNAsincancertyping bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. ABSTRACT The primary function of microRNAs (miRNAs) is to maintain cell homeostasis. In cancerous tissues miRNAs’expressionundergodrasticalterations.Inthisstudy,weusedmiRNAexpressionprofilesfrom The Cancer Genome Atlas (TCGA) of 24 cancer types and 3 healthy tissues, collected from >8500 samples.Weseektoclassifythecancer’soriginandtissueidentificationusingtheexpressionfrom1046 reportedmiRNAs.DespiteanapparentuniformappearanceofmiRNAsamongcanceroussamples,we recover indispensable information from lowly expressed miRNAs regarding the cancer/tissue types. Multiclass support vector machine classification yields an average recall of 58% in identifying the correct tissue and tumor types. Data discretization has led to substantial improvement reaching an average recall of 91% (95% median). We propose a straightforward protocol as a crucial step in classifying tumors of unknown primary origin. Our counter-intuitive conclusion is that in almost all cancer types, highly expressing miRNAs mask the significant signal that lower expressed miRNAs provide. Keywords: TCGA,non-codingRNA,Cancerofunknownprimaryorigin,Machinelearning,Cancergenomics 2 bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. INTRODUCTION MaturemicroRNAs(miRNAs)areshort,non-codingRNAmolecules.Theyleadtoposttranscriptional repression by reducing the stability of mRNAs and attenuating the translation machinery (1). In humans,thereareabout1900miRNAgenesthatyield~2600maturemiRNAs.Theexpressionlevelsof miRNAsinhealthytissuesspan5to6ordersofmagnitude(2). In all multicellular organisms including humans, miRNAs were implicated in embryogenesis, tissue identityanddevelopment.Asthegatekeepersofcellhomeostasis(3),miRNAsrespondtoalterationin cell regulation that occurs during viral infection, inflammation, and numerous pathologies. In mammals, cell transformation and carcinogenesis are accompanied by drastic alterations in miRNA expression profiles (4,5). Those miRNAs that are associated with carcinogenesis and metastasis are called oncomiRs (6). They were implicated in targeting components of the cell cycle, DNA repair, oncogenesortumorsuppressorgenes(7,8).MostoncomiRsareexpressedinmultiplecancertissues. However,somearespecifictoonlycertaincancertissues.Forexample,humanmiR-21over-expression isassociatedwithalmostallcancerswhilemiR-15andmiR-16expressionsaremostlyassociatedwith B-cellneoplasm.OthermiRNAs(miR-143andmiR-145)directlyregulateotheroncomiRsandthusare candidatesforanti-tumortherapy(9).Therefore,alteredmiRNAexpressionincancerousversushealthy tissuesissuggestedasinvaluablebiomarkers,forcancerdiagnosisandprognosisandasaleadfornovel therapeuticapproach(10).DespiteadvancesinunderstandingcellderegulationbymiRNAsincancers formostmiRNAs,arelationbetweenexpressionlevel,diagnosisandprognosisforspecificcancertypes cannotbedrawn(seediscussionin(11,12)). From a clinical perspective, profiling miRNAs is important for: (i) Selecting optimal treatment; (ii) Monitoringthedisease’sprogression;(iii)Identifyingtheprimaryoriginofametastaticcancer(12).For example, miR-10b level was shown to be a prognosis indicator for chemotherapeutic resistance in colorectalcancer(13)whilemiR-30cinhibitstumorchemotherapyresistanceinbreastcancer(14).The 3 bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. monitoring of minute miRNAs levels in patient’s body fluids allows a follow up for treatment and a directassessmentfordisease’sprogression(5,15). Manyofthesamplescollectedfromcancerpatientsaremetastatic.Evidently,itismorechallengingto treatmetastaticthanprimarycancers(seereviews(16,17)).Importantly,theoriginofupto12-15%of allnewlydiagnosedcancersisunknownoruncertain(18).Duetothedifficultyofidentifyingtheorigin of metastatic tissues, cancers of unknown primary origin (coined CUP) are associated with poor prognosisandinconclusiveprotocolfordiseasemanagement(19).Consequently,thereisanactiveline ofresearchseekingwaystoidentifythetypeandoriginoftissuesfrommetastaticbiopsiesthrough their molecular signatures (for example (12,20,21)). Success in classifying lung squamous cell carcinomasfromlungadenocarcinoma(22)wasfurtherexpandedtofourmainsubtypesoflungcancer usingmiRNAprofiles(23).Inanotherstudy,aclassifierbasedon47miRNAssuccessfullyseparated10 differenttissuesfrom~100samplesfromprimaryandmetastaticsamples(24).Expandingthetaskto 42cancertypesresultedinanextendedsetof68miRNAs(25)thattogetherareusedasindicatorsfor tissue of origin in numerous biopsies from cancer patients (26). Although most measurements of miRNAsoriginatefromcancerbiopsies(27),miRNAscirculatinginbodyfluidshasbecomeattractive candidatesforearlydiagnosis(28,29). ThegoalofourstudyistousethemiRNAexpressionprofiledatafromTheCancerGenomeAtlas(TCGA) toward the task of classifying different cancerous tissues and their disease/healthy states. TCGA providesrichmoleculardatafromcancerpatientsonanunprecedentedscale(30),withsamplesof>25 cancertypes.WefocusedonmiRNAprofilesthatreportonthenormalizedexpressionlevelofeachof the1046analyzedmiRNAs.Specifically,weask:(i)WhichmiRNAsbestcapturetheinformationneeded to distinguish the various cancerous types and origin? (ii) Which tissues are prone to false classifications?(iii)CanwefindabiologicalinterpretationofthesetofinformativemiRNAs? MATERIALSANDMETHODS 4 bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. Datacollection MicroRNAs(miRNAs)profileswereextractedfromTheCancerGenomeAtlas(TCGA,collectedbetween October2013andJanuary2014)(31-34).Wedefinedaclassbycouplingtheoriginatingtissuewiththe labelofthesample’sstate(i.e.,cancerousorhealthy).Welimitedouranalysistoclasseswithover50 samples.Intotalwecollected27classes.ThreeclassesofhealthytissuesareabbreviatedasKidney, ThyroidandBreast.Additionally,24cancerousclassesandtheiracronymsarelisted:adrenocortical carcinoma(ACC),bladderurothelialcarcinoma(BLCA),brainlowergradeglioma(LGG),breastinvasive carcinoma(BRCA),cervicalsquamouscellcarcinomaandendocervicaladenocarcinoma(CESC),colon adenocarcinoma (COAD), esophageal carcinoma (ESCA), head and neck squamous cell carcinoma (HNSC),kidneychromophobe(KICH),kidneyrenalclearcellcarcinoma(KIRC),kidneyrenalpapillary cell carcinoma (KIRP), liver hepatocellular carcinoma (LIHC), lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LUSC), pancreatic adenocarcinoma (PAAD), pheochromocytoma and paraganglioma (PCPG), prostate adenocarcinoma (PRAD), rectum adenocarcinoma (READ), sarcoma (SARC), skin cutaneous melanoma (SKCM), stomach adenocarcinoma (STAD), thyroid carcinoma (THCA),uterinecarcinosarcoma(UCS)anduterinecorpusendometrialcarcinoma(UCEC). The27classesinclude8522distinctsamples.Eachsampleisassociatedwiththeexpressionof1046 miRNAs. We assessed the similarity among patient samples by calculating the cosine of the angle betweenthevectorsofmiRNAexpression.Thesecosineswereusedasaproxytotheinitialseparation abilityandsimilarityamongclasses.ThelistsofallsamplesandmiRNAs’TCGAidentifiersareprovided insupplementarydataTableS1. Datatransformation Weapplyaroughdiscretizationtothedata,accordingtomiRNAexpressionlevel,measuredinreads permillion(RPM).Westartwithadichotomy-onecategoryforexpressionlevelsstrictlybetween0 and30RPM,andasecondcategoryforeverythingelse.Asomewhatmorerefinedclassificationisa trichotomy,thatpartitionsthedatatonon-expressed,lowlyexpressed(expressionlevelsfrom0to30 5 bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. RPM)andhighlyexpressed(>30RPM).Onaverage,eachsamplehas306lowlyexpressedmiRNAs(s.d. of43.91),and163highlyexpressedmiRNAs(s.d.of18.58). Machinelearningmodel HumanTCGAsampleswereclassifiedtothe27classes(seeabove)bytrainingamulticlasssupport vector machine (SVM) classifier (35). We used a Python implementation based on the scikit-learn package(36).Thetrainingsetsfortherawdata,thedichotomyandthetrichotomywerelimitedto40% ofthesamplesofeachclass.ThesensitivityoftheSVMclassifiertothesizeofthetrainingsetwas tested.Thetestswererepeatedwithtrainingsetcomprisedof20%to80%ofeachclassofsamples, with increments of 10%. Our results for both data transformations were consistent and stable for trainingsetsofatleast40%ofthesamplesandsomewhatlessstablefortrainingsetsbelowthe40% threshold. Both dichotomy and trichotomy schemes were run 1000-10,000 times each. Scikit-learn cross-validationmechanismwasusedinordertopreventover-fitting. Priortodataanalysiswerandomlypartitionalldatainto40%thatareunseenand60%thatareseen. Wecreate5trainingsetseachofwhichiscomprisedofarandomsubsetoftwo-thirdsofseenpart.For eachset,twoclassificationmatriceswerecreated,onebasedonthe30RPMthreshold,andtheother basedona60RPMthreshold,forseparatingthelowly-expressedfromthehighly-expressed.Wethen runourmachine-learningprocedureoneachofthese5setsandtestthemontheunseenpartofthe data.Ourfinalclassificationischosenbycarryingapluralityvoteamongthese10classificationresults. Performanceevaluationanderroranalysis Weassessedourmodelsaccordingtotheiraveragerecallrateforeachoftheclasses.Recallisdefined as[truepositive]/[truepositive+falsenegative].Theaverage,medianandstandarddeviation(s.d.) weremeasuredforover300independentrunsoftheSVMprotocolonrawordiscretizeddata. Wecarriedouttwotypesoferroranalysis:Atthesamplelevelandattheclasslevel.First,weanalyzed the errors made per each of the samples over distinct runs (Sample-based Error). A second error 6 bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. analysisconcernsaclassmisclassificationsandtheidentityofthefaultyassignment(Class-basedError). The plurality vote was implemented in our final classification in order to overcome Sample-based Errors.WehavemergedtheclassesREADandCOADinordertoreducetheextentof Class-based Errors. InformativemiRNAextraction Several methodologies were applied to identify and select a minimal set of miRNAs that are most informativetowardsasuccessfulclassificationtask.Ourbestresultswereobtainedusingtheabove described trichotomy. Specifically, for each miRNA we create an 81-dimensional vector, with 3 coordinatesforeachofthe27classes(24cancertypesand3healthytissues).Thesethreecoordinates aredefinedasfollows:IfthereareNsamplesinthisclassforthemiRNAathand,ofwhichN_0arenot expressed, N_1 are between 0 and 30 RPM and N_2 are highly expressed, then we set the first coordinateatN_0/N,thesecondtoN_1/NandthethirdtoN_2/N.Wethencalculatethestandard deviation(s.d.)ofeachoftheexpressionlevelsamongthe27classesseparately,andcalculatethesum ofthes.d.Formally: 𝑠𝑑(𝑁_0' , 𝑁_0) , … , 𝑁_0)+ ) + 𝑠𝑑(𝑁_1' , 𝑁_1) , … , 𝑁_1)+ ) + 𝑠𝑑(𝑁_2' , 𝑁_2) , … , 𝑁_2)+ ). WeexpectmiRNAsofhighersummeds.d.valuestobemoreinformative,sincethes.d.capturesthe miRNA’svariabilityamongdifferentclasses.Wechosenottoassigndistinctweightstodifferentclasses despitetheirdifferingsamples’size(rangingfrom57forUterineCarcinosarcomato1066forBreast InvasiveCarcinoma). RESULTS ManycanceroustissueshavesimilarmiRNAprofiles ThemaingoalofourstudyistoclassifycancerclassesusingtheinformationencodedbymiRNAprofiles frompatients.Formostcancertypes,asmallnumberofhighlyexpressedmiRNAsundergoadrastic 7 bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. changeinexpressionlevelalongtheprogressionofthedisease.Therefore,itisacommonlyheldview thatuniquelyexpressedmiRNAsmightbeusedfordiagnosis,prognosisandpossiblyalsofordisease treatment.HerewerevisitthisviewandtestthemiRNAheterogeneityandexpressionconsistencyin an unbiased way. We analyze many different cancer types and tissues, and thousands of patient samples. WewantedtogetasenseofthesimilaritybetweenthevectorsofmiRNAreadsamongsamplesfrom eachgiventissue.WefoundhighcorrelationsamongthemiRNAexpressionvectorswhenbothsamples are healthy or when both are diseased (within the same tissue). The correlation is much lower for healthyvs.cancerousone(datanotshown).Figure1showsananalysisforallpairsofpatientswith lungadenocarcinomasamples(521samples,abbreviatedLUAD)andmatchedhealthysamples(~50 samples).Thesimilaritybetweeneachpairisindicated(cosinevalues,seeMaterialsandMethods).We showthathealthyandcanceroustissuesexhibitdifferentpatterns.Importantly,healthysamplesare similartoeachanotherasarethediseasedsamples. We tested whether this property (Figure 1) extends to all other cancerous and healthy tissues. We thereforerestrictedouranalysistotissuesforwhichsamplesfrombothhealthyandcancerouspatients areavailable.AsFigure2Ashows,eachhealthytissueiswellcharacterizedbyitsmiRNAs’profile.For example,breasttissuesindifferentpatientshaveverysimilarmiRNAprofiles,anddiffersignificantly from liver profiles. Even the two types of lung tissues are visibly distinguishable from each other. However,thedistancematrixforthediseasedsamples(5classes,~2700patients,Figure2B)isrelatively uniform.Whileallthediseasedtissuesarerathersimilarusingoursimilaritymatrix,asomewhathigher similarityisvisiblebetweenthetwolungdiseases,theadenocarcinoma(LUAD)andlungsquamouscell carcinoma(LUSC).WeconcludedthatthetrendthatweobservedinFigure1appliestoothertissues. Namely,thedistanceamongallpairsofdiseasedtissuesislowerthanthatamongthecorresponding healthytissues. Figure2Coffersabird'seyeviewofmiRNA-basedinformationforall8522samplesthatwereincluded inouranalysis,accordingto27classesdiscussedinthisstudy.Theentriesshowtheaverageofthe 8 bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. correlations over all samples in each pair of classes. Note that the values on the diagonal are only slightlyhigherthantherestofthevaluesinthissymmetricmatrix.Itisnoteworthythatthatbrainlower grade glioma (LGG) markedly differs from other cancerous tissues, yet exhibiting a significant selfsimilarity.Additionally,sarcomasamples(SARC,diagonal,Figure2C)exhibitratherlow(0.61)average correlation,suggestinghighvariabilityofthemiRNAexpressioninthiscancertype. LowestexpressedmiRNAsprovidedifferentiatinginformation InviewoftherelativelyhomogenouspatternofmiRNAprofilesamongdiseasedtissues(Figure1),itis clearthattheidentificationofcancerclassescannotbetriviallybasedonapredefinedsimplesetof miRNAs.Thus,wesoughtaclassifierthatcancorrectlylabelatissueoforiginbasedoninformation thatisgloballycapturedbythemiRNAs’profiles.Tothisend,weappliedamulticlasssupportvector machine (SVM) classifier with a randomly chosen training sets (Figure 3A). The flow of the methodologies used in this study is shown. Our classifier employs two variations of data transformation,followingbyastepthatbenefitsfromstudyingthetypicalerrorsinourclassification task(Figure3A). SVMappliedtotherawdatatoidentifythetissueoforiginachievesanaveragesuccessrateof58%, andamedianof69%(27classes)(Figure3B,blue).Notethatnaïveguessingwouldyield~4%success rateforclassifyingasampletoanyofthe27classes.Interestingly,for10outofthe27classesattains recall rate higher than 80%, and for 5 classes the success rate is even above 90%. However, the predictionforsomeclassesisverypoor(<10%).ThebestperformanceisassociatedwithLGG,ascould beanticipatedfromtheobservationinFigure2C. However,forclinicalpurposes,asuccessrateof58%isunsatisfactory.Wethusfocusedonsearchinga subsetofmiRNAsthataremostinformativetowardsthetaskofclassifyingtissueoforiginandsample status(diseases,healthy).TheexpressionlevelsofmiRNAexhibitahighdynamicrangeof5-6orderof magnitudesandexpressionisexpressedinRPM(readspermillions).Furthermore,itturnsoutthat onlyasmallnumberofmiRNAs(20-25)accountforthevastmajorityofthereadsinthesamples(9698%, not shown). Based on these observation, we tested the potential of discretizing miRNAs’ raw 9 bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. values,thusallowingarepresentationthatemphasizesthecontributionoflowlyexpressedmiRNAs. WeanticipatedthatthelongtailofmiRNAsmightbemoreinformativefortheclassificationpurposes withrespecttoasmallsetofdominatingmiRNAs. WeappliedasimpledichotomythatpartitionsthelowlyexpressedmiRNA(<30RPM)vs.therestof themiRNAs.Typically,~70%ofthemiRNAsareexpressedattherangeof>0to30RPM.Thissimple transformation (see Materials and Methods) dramatically reduces the volume of data, and concentratesononly~0.02%oftheoriginalsignal(i.e.,thesumtotalofmiRNAsexpressionvalues). Importantly,theapplicationofnaïvedichotomyalreadyimprovestheperformanceofourclassifierto anaveragesuccessrateof66%(medianof80%,Figure3B). Wesoughtadditionaldataprocessingprocedurestofurtherimprovetheclassificationperformance. IndichotomywetreatequallymiRNAsthatareexpressedatlevelsabovethethreshold(>30RPM)and thosethatarenotexpressedatallinaspecificclass.Wenextseparatethesetwosets,andreplacethe datatransformationtoatrichotomy(Figure3A,seeMaterialsandMethods).Thistransformationyields asubstantialimprovement.Theclassifierthusreachesanaveragesuccessrateof88%(median91%, Figure3BandTable1).Figure3Cshowstheimprovementintheclassificationsuccessforthetrichotomy mode with respect to the original raw data. It is clear that some classes benefit more from this transformationthanothers.For8outofthe27classestheobservedimprovementwas>3folds(Figure 3C). Weanalyzedthedependenceofthesuccessrateonthechosenthreshold(<30RPM)andobserved that the recall rate remained virtually unchanged for thresholds ranging between 5 to 300 RPM. However,raisingthethresholdabove300RPMaffectstheperformancenegatively,suggestingthatat thisrangeofparameterweareactuallyaddingnoisyinformation(Figure4).Itisevidentthatforboth thedichotomyandtrichotomymodes,informationwithintheverylowexpressionlevelsiscriticalfor theclassificationtask.Thecontributionoftheverylowly-expressedmiRNAstotheclassificationtask wasfurtherassessed.WezoomedonmiRNAsexpressedatlevelsx,wherexrangesfrom>0to30.We trainedtheclassifieragainfordifferentvaluesofxandartificiallyrelabeledmiRNAsthataresmaller 10 bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. than x as non-expressed. When marking as non-expressed miRNAs with expression levels under 10 RPM,therewasnosignificantchangetotheclassificationresults,whileapplyingx>10RPMledtoa declineinperformance.Typically,only~8%ofthemiRNAsarelistedintherangeof10-30RPM,while 25%oftheentirelistofmiRNAsarewithintherangeof1to10RPM.Weconcludethatlowlyexpressed miRNAsthatcarryfundamentalinformationfortheclassifier,mostly(butnotentirely)resideinthe10 to30RPMrange. Notalltissuetypesareequallyhardtoclassify Asmentioned,trichotomyhasresultedinasubstantialimprovementformanyclassifiedtissues.For11 outofthe27classes,weconsistentlyreachedasuccessrateof95%orabove(Table1).Importantly, theclassificationsuccessdoesnotimmediatelyreflectthesamplesize.Forexample,a96%successfor Adrenocortical Carcinoma (ACC) is based on only 80 samples while lung squamous cell carcinoma (LUSC)achievedonly84%successwithasamplesizewhichis6-foldlarger(Table1).Specifically,weare doingpoorlyonclassifyingesophagealcarcinoma(ESCA)andrectumadenocarcinoma(READ),witha successrateof43%and39%,respectively. Wetestedthesensitivityoftheclassificationperformancewithrespecttoalternativemachinelearning methods:RandomForestandSVMwithpolynomialandradialbasisfunction(RBF)kernels.Theresults forRandomForestandSVMwithpolynomialkernelweresomewhatworsethantheusedSVM(see Materials and Methods). RBF kernel gave an improved performance on the raw data, but failed to improvefollowingdatadiscretizationanddatatransformation. Athirdoftheclassificationerrorsareinterpretable Wetriedtorefineourunderstandingonfeaturesthatgovernthesuccess/failureintheclassification task.Tothisendweinspectedtheresultsinviewofthedifferenterrorsthataremade.Figure5presents a“wheelview”forall27classeswithavisualindicationonthesamples’sizeforeachclass,andthe typical errors. Cross-edges in the graph and their color indicate the extent and nature of all misclassifications.Thewheelindicatesthetrue-positiveandfalse-positive(innerarc),aswellasthe 11 bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. true-positivesandfalse-negatives(middlearc).Fromtheoutermostarconecanestimatethesumof thetwoerrortypes.ThesourcedataisavailableinSupplementalTableS2. Weareoftenunabletointerprettheerrorsinclassification.Wecategorizedtheerrorsbybiological relevanceoftheclassificationerrors:(a)Cancerousstateerrors:Correctclassifiedtissuebutfailurein determining healthy vs. diseased state. These are restricted to the 3 instances of tissues that are representedbybothhealthyandcancerousinstances.(b)Intra-tissueerrors:misclassificationofone classtoanotherthatisanatomicallyadjacent.Sucherrorsarelimitedtoclasseswithmultiplediseases of the same/ related organ. (c) Inter-class errors: classification mistakes that are not obviously interpretable. Figure 6A shows schematically the possible error types for all 27 classes. Figure 6B indicatesthedistributionofdifferenterrortypesamongtherelevantcategoriesoferrors.Figure6C focusesonanexampleforinter-classerrorandthesourceforfalsepositives.Itshowsthatinthecase ofBLCA,themajorityofthefalsepositives(10.6%outof12.7%)derivefrommisclassificationfrom6 classesofcancercarcinomas(additionalclasses,thatcontribute<1%offalsepositiveseacharenot shown). Improvingclassificationsuccessfromerrors’consistency AsFigures5-6demonstrate,manyerrorsarerecurrentamongspecificclasses.Wetestedtheeffectof mergingclassesthatarecharacterizedbyabundant,bi-directionalfalseassignmentsbytheclassifier. The most prominent case involves Adenocarcinoma of the Rectum and Colon (READ and COAD, respectively). Merging these two classes has improved the average recall rate, while merging other pairsofclasseshasworsenedtheoverallclassificationsuccess(notshown). Whenweexaminedtheerrorspersampleittranspiredthatforcertainsamples,classificationvaried withthe(random)choiceoftrainingsets.Apluralityvoteprotocolwasappliedinordertoreducethis dependency(seeMaterialsandMethods).Thepluralityvoteledtoanadditionalimprovementinthe averageclassificationsuccess.Formostclassestheimprovementisrathermodest.However,merging theRectum(READ)andColonAdenocarcinoma(COAD)classesthatwerereportedwith39%and80% 12 bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. success, respectively was highly beneficial. Actually the merged collection of samples reached 97% recall.Theaveragerecallrateforthevalid26classeshasreachedafurtherimprovementto91%,with amedianof95%(Table1). Additionaltestsinwhichtheclassifierwastrainedtoidentifytheclinicalstagesofthedisease(i.e., cancer stage 1 versus stage 3) and other global features, like age and gender, resulted in poor performanceandwillnotbefurtherdiscussed. LowestexpressingmiRNAunveilinformativemiRNAs Notwithstanding the impressive success rate of the classifier (91%, with 95% median), we are still unabletoquantifythecontributionofanyspecificmiRNAtothissuccess.Tothisend,wedesigneda ranking score for miRNAs according to their potential in contributing to the classifier success (see Materials and Methods). Empirically, we compared the results achieved when classifying with a reducedvectorofmiRNAexpression.Werandomlydrewvariousgroupsof50miRNAsoutofthe1046 miRNAdiscussedinthisstudy(TableS1)andtestedtheclassificationresultswhenusingthedataasis (rawdata),aswellasfollowingthetrichotomytransformation(Figure7A).Wealsotestedtheresults whenselectingthe50mostinformativemiRNA(accordingtorankbysumofthestatisticalvariation, seeMaterialsandMethods).WerepeatedthisrandomizationtestforagrowingsetofmiRNAs.Note that as the set size grows, a random set tends to include more informative miRNAs (e.g., for 200 miRNAs).Consequently,thedifferencesbetweentheperformanceoftherandomizedsetandthemost informativesetisreduced.Alreadythe50mostinformativemiRNAsattainanimpressive75%success rate,suggestingthatmuchoftheclassificationsignalisalreadycapturedbythesetthatischaracterized bymaximalvariance(Figure7A).With300ofthemostinformativemiRNAs,successratereachedon average87%,similartowhatisachievedwhenallthemiRNAsareemployedalongwiththetrichotomy transformation. WetestedwhetherthemiRNAsthatcontributemosttotheclassificationtaskareindeedthosethat wereimplicatedincancerprogression,prognosis,anddiagnosis.Tothisend,weanalyzedthe20most 13 bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. informativemiRNAsbythecriterionusedinFigure7A.Thesourceofvariabilityisshownasreflected bythes.d.associatedwiththe3coordinatesforeachclass(noexpression,lowlyexpressed(0<x<30) andhighlyexpressed(>30RPM).ForacompleterankofinformativemiRNAsseeSupplementalTable S3.ItisclearthatsomeinformativemiRNAsrelyonvariabilityinthevectorofhighlyexpressed(Figure 7B,comparemiR-205tomiR934).Forothers,novariabilityisobservedinthevectorfornon-expressed (Figure7B,blue).Whilebothhsa-mir-141andhsa-mir-215areamongthemostinformativemiRNAs, their source of variability is rather different. While hsa-miR-205 is characterized by a very large variationtointhehighlyexpressedsection,yetasubstantialvariabilityisrecordedforthismiRNAalso intheothertwocategories(non-expressedandlowlyexpression). Discussion CellularquantitiesofmiRNAs OurresultsuncoverthepoweroftheleastexpressedmiRNAsindistinguishingalargecollectionof tissuesandcanceroustypes.ThecommonnotionisthatthequantityofmaturemiRNAsinthecellis criticalforcellregulation.Specifically,excessofaspecifictypeofmiRNApotentiallytitratesoutgenuine mRNAtargets(37).Asasecondaryeffect,anoverflowofspecificmiRNAwillmostlikelyresultinits bindingtonon-genuinetargets,theso-calledcalled“offtarget”effect(38).Theoverallquantitative effectofmiRNAlevelsisdiscussedintermsofcompetition(39,40)andcooperativity(41).Moststudies thatconsidermiRNAsinlivingcellstendtobebiasedtowardtheanalysisofhighlyexpressedmiRNAs. Withinsuchframework,lowlyexpressedmiRNAsaremostlikelydiscardedwiththepremisethatthey havenoimpactofthemasseffect. In this study, we show for the first time that for the task of cancer classification, miRNA-based informationisencodedbytheminuteexpressionofmiRNAsthattogetheraccountforonly0.02%of thetotalreads.Moreover,itisapparentlythelongdistributionaltailofmiRNAexpressionthatcarries thesignalforcorrectlyidentifyingmultiplecancer/tissuetypes.Weactuallyshowthatincludingthe 14 bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. entire profile of miRNAs leads to poorer performance as the principal, informative signal for class identificationismasked. A practical consequence of our observation concerns the preferred sequencing depth needed for a routine miRNA profiling. The high coverage that is reported for all samples in TCGA, entails a high sensitivity.Specifically,about45-50%ofthemiRNAs,inalmostallclassesareexpressedatalevelof0 to1RPM.ThemiRNAlistwithvaluesof1-10RPMaccountsforanadditional25%ofthemiRNAlist. Classically,whensequencingshortRNA(<200nt),mostprotocolscallforacoveragerangingfrom0.5 to1.5Mreads,undertheassumptionthatdiscoveryratefornewmiRNAsisextremelylowabove2M reads(42).WearguethatthehighcoverageprovidedbytheTCGAisessentialfor(i)increasedreliability ofmiRNAidentification(i.e.,reportingonexpressionfromalistof1046miRNAs);(ii)highlightingnonclassicalandoftenlowexpressingmiRNAs. LowexpressingmiRNAactasintrinsicmarkersfortissuespecificity Anunexpectedoutcomeofourstudyreferstothestabilityandconsistencyintheexpressionofthe lowlyexpressedmiRNAsthroughmanysamplesofthesamecancerclass.Ampleobservationsshow that extreme conditions (e.g., stress, transformation, viral infection, differentiation) lead to drastic changes in miRNA expression profiles in cell types. We postulate that in conditions where miRNA expressionpatternsdramaticallychange(e.g.,cancer,stemcelldifferentiation,hypoxiaetc.),thetailof the miRNA distribution is hardly affected and consequently cell identity and the origin of the cells remainrobust.Thisinterpretationissupportedbythegradualmanipulatingofthedatabyremovalof miRNAsthatbelongtothelowlyexpressedtailandassessingtherobustnessoftheclassifiertosuch manipulations(Figure4). miRNAprofilecanbeusedforsub-classificationofcancertypes TCGAisafastgrowingresourceandhasalreadyreached>10,000sampleswhichareassignedwith33 cancertypes.Themethodologypresentedinthisstudyisapplicableforaclassificationtaskforany numberofdiseases’classesthataresupportedbyenoughinstances.Weshowedthattheclassification 15 bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. performanceisveryhighforasmallsetofclasses(Figure3BandTable1).Forexample,thesuccess rateforBreastInvasiveCarcinoma(BRCA)is96-98%.However,awidespectrumofsurvivalrates(43,44) is associated with breast cancers, supporting the notion of heterogeneous molecular basis of this disease. Therefore, managing the disease is likely to benefit from a refined classification within the broadclassofBRCA.AprognosisprofilecombiningmRNAs,miRNAsandDNAmethylationhadbeen proposed(45).Inadditiontothegenesthatwereimplicatedasdrivermutations,theauthorsidentified number of informative miRNAs such as hsa-miR-328, hsa-miR-484, and hsa-miR-874. While these miRNAswereassociatedwithcellproliferation,angiogenesisandtumor-suppressivefunctions(46,47), they are poor separators of BRCA from other major tumor types(48). Actually, these miRNAs were implicatedinColonAdenocarcinoma(COAD)(49)andmostlyinrenalandlungcarcinomas(48). Importantly,eachcancertypedisplaysitsuniquemiRNAset.Forexample,overexpressedmiRNAsthat wereimplicatedingastriccancerincludemiR-17-5p,miR-18a,miR-18b,miR-19a,miR-20b,miR-20a, miR-21,miR-106a,miR-106b,miR-135b,miR-183,miR-340-3p,miR-421andmiR-658.Incoloncancer, inadditiontothevalidatedoncomiRs,miR-16,miR-31,miR-34a,miR-96,miR-125bandmiR-133bwere overexpressed. The list of cancer associated miRNAs for pancreatic ductal adenocarcinoma (PDAC) includes miR-15b, miR-95, miR-155, miR-186, miR-190, miR-196a, miR-200b, miR-221 and miR-222 (reviewed(50)).AmongtheunderrepresentedmiRNAs(calledanti-oncomiRs)let-7a-1,miR-143and miR-145werevalidatedforanumberofcancertypes. We found no statistical evidence for an overlap between the ‘most informative’ miRNAs (e.g., supplementalTableS3,Figure7B)andmiRNAsthatarereportedintheliteraturetogoverncancer progression.ThelaterincludesanextendedlistofoncomiRsincludingmiR-15,miR-16,miR-17,miR18,miR-19a,miR-19b,miR-20,miR-21,miR-92,miR-125b,miR-155andmiR-569.Wepostulatethat thereisnodirectcorrespondencebetweentheinformativetailoflowlyexpressingmiRNAsandthe dominatingmiRNAsthatbestassociatedwithatransitionofacelltoitstransformed,cancerousstate. 16 bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. Amongthetop20mostinformativemiRNAs,miR-205andmiR-141exhibitmaximalvariabilityforthe 27classes(byavectorof81values,seeMaterialsandMethods).AdditioninformativemiRNAs(Figure 7B) include miR-7-3, miR-31, miR-135a/135b, miR-149, miR-196a-1, miR-200a/200b/200c, miR-215, miR-224, miR-328, miR-429, miR-552, miR-934, miR-944 and miR-1251. In view of the lack of any statistical enrichment between miRNAs assigned as the most significant ones (Figure 7B) and the collectionofoncomiRs/anti-oncomiRsfromtheliterature,wepostulatethatthemulticlasstypingand markersfortumorigenesisrepresentingdifferentaspectsinthecharacterizationofthedisease. Inspecting the top 50 miRNAs from our results addresses (indirectly) a question on the nature of miRNAsthatcapturemostoftheclassificationinformation.AmongthemostvariablemiRNAs,three ofthemiRNAs(hsa-miR-141,hsa-mir-200aandhsa-mir-200b)belongtoonefamily(mir-8)withahigh correlationintheirappearanceinall27classes(r2=0.71-0.74). Clinicalapplicationandpremises Ourresultscanbeusedinclinicaltreatmentofpatientshavingcancerofunknownprimaryorigin(CUP). However, toward improvements in clinical guidelines, additional measurements (e.g., immunehistochemical,PCRformRNAexpression)arelikelytobebeneficial.AsetofmiRNAsthatbestclassify cancersamplesbytissueoforiginwaspresentedinviewofthesuccessindiagnosisofCUP(20,25,26). A significant overlap between the most informative miRNAs (extracted from (26)) and our SVM protocolwasfound.Bycomparinginformative50candidatesfromthetwostudies,wedetected10 shared miRNAs (miR-122, miR-138-1, miR-141, miR-146a, miR-196a, miR-200a, miR-200c, miR-205, miR-31andmiR-9-3).Despitealargedifferenceintheanalyzeddata(500versus8000samples)and themethodologyused(decisiontreevsSVM),theconsistencyamonginformativemiRNAsisintriguing (p-value2E^5).Ampleevidencedemonstratedtheimpactofalterationinhsa-miR-200/hsa-miR-141 expressiononproliferation,morphologyandaberranthistoneacetylationinnumerouscancersandcell types.AnoverviewonthecontributionofmiRNAsforcancerdiagnosis,theadvantagesandlimitations ofseveralcomplementarymethodswasrecentlyreported(51). 17 bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. Lastly,weinspectedthefeaturesthatspecifytheleastinformativemiRNAsinourstudy.Thereare90 miRNAs (out of 1046) that are completely non informative to our task. Namely, in the trichotomy transformationtheyareuniformlyexpressedwithrespecttothe3*27classes.ForthesemiRNAsthe sumofthestandarddeviations(TableS3)foreachofthetrichotomylabeliszero.Obviously,these miRNAs cannot contribute to the classification task. Interestingly, several validated omcomiRs are among these miRNAs (e.g., miR-16, miR-17, miR-18 and miR-21). This emphasizes the uncoupling between the informative low expressing miRNAs we have identified for the classification task and highlyexpressedoncomiRs.ThelevelofexpressionofoncomiRsareusedasvaluablemarkersforcell dysregulationandassucharethehallmarkofproliferationandtransformationincancer. Funding SupportedbyERCgrant339096onHighDimensionalCombinatoricsandELIXIR-Excelerategrant. References 1. Iorio,M.V.andCroce,C.M.(2012)MicroRNAdysregulationincancer:diagnostics,monitoring andtherapeutics.Acomprehensivereview.EMBOmolecularmedicine,4,143-159. 2. Baffa,R.,Fassan,M.,Volinia,S.,O'Hara,B.,Liu,C.G.,Palazzo,J.P.,Gardiman,M.,Rugge,M., Gomella,L.G.,Croce,C.M.etal.(2009)MicroRNAexpressionprofilingofhumanmetastatic cancersidentifiescancergenetargets.TheJournalofpathology,219,214-221. 3. Dumortier,O.,Hinault,C.andVanObberghen,E.(2013)MicroRNAsandmetabolism crosstalkinenergyhomeostasis.Cellmetabolism,18,312-324. 4. Calin,G.A.andCroce,C.M.(2006)MicroRNAsignaturesinhumancancers.Naturereviews. Cancer,6,857-866. 5. Volinia,S.,Calin,G.A.,Liu,C.G.,Ambs,S.,Cimmino,A.,Petrocca,F.,Visone,R.,Iorio,M., Roldo,C.,Ferracin,M.etal.(2006)AmicroRNAexpressionsignatureofhumansolidtumors definescancergenetargets.ProceedingsoftheNationalAcademyofSciencesoftheUnited StatesofAmerica,103,2257-2261. 6. Wang,D.,Gu,J.,Wang,T.andDing,Z.(2014)OncomiRDB:adatabasefortheexperimentally verifiedoncogenicandtumor-suppressivemicroRNAs.Bioinformatics,30,2237-2238. 7. Jansson,M.D.andLund,A.H.(2012)MicroRNAandcancer.Molecularoncology,6,590-610. 8. Lujambio,A.andLowe,S.W.(2012)Themicrocosmosofcancer.Nature,482,347-355. 18 bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. 9. Kitade,Y.andAkao,Y.(2010)MicroRNAsandtheirtherapeuticpotentialforhumandiseases: microRNAs,miR-143and-145,functionasanti-oncomirsandtheapplicationofchemically modifiedmiR-143asananti-cancerdrug.Journalofpharmacologicalsciences,114,276-280. 10. Hayes,J.,Peruzzi,P.P.andLawler,S.(2014)MicroRNAsincancer:biomarkers,functionsand therapy.Trendsinmolecularmedicine,20,460-469. 11. Lu,J.,Getz,G.,Miska,E.A.,Alvarez-Saavedra,E.,Lamb,J.,Peck,D.,Sweet-Cordero,A.,Ebert, B.L.,Mak,R.H.,Ferrando,A.A.etal.(2005)MicroRNAexpressionprofilesclassifyhuman cancers.Nature,435,834-838. 12. DiLeva,G.andCroce,C.M.(2013)miRNAprofilingofcancer.Currentopinioningenetics& development,23,3-11. 13. Nishida,N.,Yamashita,S.,Mimori,K.,Sudo,T.,Tanaka,F.,Shibata,K.,Yamamoto,H.,Ishii, H.,Doki,Y.andMori,M.(2012)MicroRNA-10bisaprognosticindicatorincolorectalcancer andconfersresistancetothechemotherapeuticagent5-fluorouracilincolorectalcancer cells.Annalsofsurgicaloncology,19,3065-3071. 14. Bockhorn,J.,Dalton,R.,Nwachukwu,C.,Huang,S.,Prat,A.,Yee,K.,Chang,Y.F.,Huo,D., Wen,Y.,Swanson,K.E.etal.(2013)MicroRNA-30cinhibitshumanbreasttumour chemotherapyresistancebyregulatingTWF1andIL-11.Naturecommunications,4,1393. 15. Bryant,R.J.,Pawlowski,T.,Catto,J.W.,Marsden,G.,Vessella,R.L.,Rhees,B.,Kuslich,C., Visakorpi,T.andHamdy,F.C.(2012)ChangesincirculatingmicroRNAlevelsassociatedwith prostatecancer.Britishjournalofcancer,106,768-774. 16. Pentheroudakis,G.,Briasoulis,E.andPavlidis,N.(2007)Cancerofunknownprimarysite: missingprimaryormissingbiology?Theoncologist,12,418-425. 17. Pavlidis,N.andPentheroudakis,G.(2010)Cancerofunknownprimarysite:20questionsto beanswered.Annalsofoncology:officialjournaloftheEuropeanSocietyforMedical Oncology/ESMO,21Suppl7,vii303-307. 18. Fizazi,K.,Greco,F.A.,Pavlidis,N.,Daugaard,G.,Oien,K.,Pentheroudakis,G.andCommittee, E.G.(2015)Cancersofunknownprimarysite:ESMOClinicalPracticeGuidelinesfordiagnosis, treatmentandfollow-up.Annalsofoncology:officialjournaloftheEuropeanSocietyfor MedicalOncology/ESMO,26Suppl5,v133-138. 19. Greco,F.A.,Oien,K.,Erlander,M.,Osborne,R.,Varadhachary,G.,Bridgewater,J.,Cohen,D. andWasan,H.(2012)Cancerofunknownprimary:progressinthesearchforimprovedand rapiddiagnosisleadingtowardsuperiorpatientoutcomes.Annalsofoncology:official journaloftheEuropeanSocietyforMedicalOncology/ESMO,23,298-304. 20. Rosenwald,S.,Gilad,S.,Benjamin,S.,Lebanony,D.,Dromi,N.,Faerman,A.,Benjamin,H., Tamir,R.,Ezagouri,M.,Goren,E.etal.(2010)ValidationofamicroRNA-basedqRT-PCRtest foraccurateidentificationoftumortissueorigin.Modernpathology:anofficialjournalofthe UnitedStatesandCanadianAcademyofPathology,Inc,23,814-823. 21. Varadhachary,G.R.,Spector,Y.,Abbruzzese,J.L.,Rosenwald,S.,Wang,H.,Aharonov,R., Carlson,H.R.,Cohen,D.,Karanth,S.,Macinskas,J.etal.(2011)Prospectivegenesignature studyusingmicroRNAtoidentifythetissueoforigininpatientswithcarcinomaofunknown primary.Clinicalcancerresearch:anofficialjournaloftheAmericanAssociationforCancer Research,17,4063-4070. 22. Bishop,J.A.,Benjamin,H.,Cholakh,H.,Chajut,A.,Clark,D.P.andWestra,W.H.(2010) Accurateclassificationofnon-smallcelllungcarcinomausinganovelmicroRNA-based approach.Clinicalcancerresearch:anofficialjournaloftheAmericanAssociationforCancer Research,16,610-619. 19 bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. 23. Gilad,S.,Lithwick-Yanai,G.,Barshack,I.,Benjamin,S.,Krivitsky,I.,Edmonston,T.B.,Bibbo, M.,Thurm,C.,Horowitz,L.,Huang,Y.etal.(2012)Classificationofthefourmaintypesof lungcancerusingamicroRNA-baseddiagnosticassay.TheJournalofmoleculardiagnostics: JMD,14,510-517. 24. Ferracin,M.,Pedriali,M.,Veronese,A.,Zagatti,B.,Gafa,R.,Magri,E.,Lunardi,M.,Munerato, G.,Querzoli,G.,Maestri,I.etal.(2011)MicroRNAprofilingfortheidentificationofcancers withunknownprimarytissue-of-origin.TheJournalofpathology,225,43-53. 25. Meiri,E.,Mueller,W.C.,Rosenwald,S.,Zepeniuk,M.,Klinke,E.,Edmonston,T.B.,Werner, M.,Lass,U.,Barshack,I.,Feinmesser,M.etal.(2012)Asecond-generationmicroRNA-based assayfordiagnosingtumortissueorigin.Theoncologist,17,801-812. 26. Rosenfeld,N.,Aharonov,R.,Meiri,E.,Rosenwald,S.,Spector,Y.,Zepeniuk,M.,Benjamin,H., Shabes,N.,Tabak,S.,Levy,A.etal.(2008)MicroRNAsaccuratelyidentifycancertissueorigin. Naturebiotechnology,26,462-469. 27. DeCecco,L.,Dugo,M.,Canevari,S.,Daidone,M.G.andCallari,M.(2013)Measuring microRNAexpressionlevelsinoncology:fromsamplestodataanalysis.Criticalreviewsin oncogenesis,18,273-287. 28. Kosaka,N.,Iguchi,H.andOchiya,T.(2010)CirculatingmicroRNAinbodyfluid:anew potentialbiomarkerforcancerdiagnosisandprognosis.Cancerscience,101,2087-2092. 29. Fesler,A.,Jiang,J.,Zhai,H.andJu,J.(2014)CirculatingmicroRNAtestingfortheearly diagnosisandfollow-upofcolorectalcancerpatients.Moleculardiagnosis&therapy,18, 303-308. 30. Monzon,F.A.andKoen,T.J.(2010)Diagnosisofmetastaticneoplasms:molecularapproaches foridentificationoftissueoforigin.Archivesofpathology&laboratorymedicine,134,216224. 31. Gore,J.,Craven,K.E.,Wilson,J.L.,Cote,G.A.,Cheng,M.,Nguyen,H.V.,Cramer,H.M., Sherman,S.andKorc,M.(2015)TCGAdataandpatient-derivedorthotopicxenografts highlightpancreaticcancer-associatedangiogenesis.Oncotarget,6,7504-7521. 32. Zhang,W.(2014)TCGAdividesgastriccancerintofourmolecularsubtypes:implicationsfor individualizedtherapeutics.Chinesejournalofcancer,33,469-470. 33. Ho,D.W.,Kai,A.K.andNg,I.O.(2015)TCGAwhole-transcriptomesequencingdatareveals significantlydysregulatedgenesandsignalingpathwaysinhepatocellularcarcinoma. Frontiersofmedicine,9,322-330. 34. Zhu,Y.,Qiu,P.andJi,Y.(2014)TCGA-assembler:open-sourcesoftwareforretrievingand processingTCGAdata.Naturemethods,11,599-600. 35. Chang,C.C.,Hsu,C.W.andLin,C.J.(2000)Theanalysisofdecompositionmethodsfor supportvectormachines.IEEEtransactionsonneuralnetworks/apublicationoftheIEEE NeuralNetworksCouncil,11,1003-1008. 36. Abraham,A.,Pedregosa,F.,Eickenberg,M.,Gervais,P.,Mueller,A.,Kossaifi,J.,Gramfort,A., Thirion,B.andVaroquaux,G.(2014)Machinelearningforneuroimagingwithscikit-learn. Frontiersinneuroinformatics,8,14. 37. Bosson,A.D.,Zamudio,J.R.andSharp,P.A.(2014)EndogenousmiRNAandtarget concentrationsdeterminesusceptibilitytopotentialceRNAcompetition.Molecularcell,56, 347-359. 20 bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. 38. Jackson,A.L.,Burchard,J.,Schelter,J.,Chau,B.N.,Cleary,M.,Lim,L.andLinsley,P.S.(2006) WidespreadsiRNA"off-target"transcriptsilencingmediatedbyseedregionsequence complementarity.Rna,12,1179-1187. 39. Tay,Y.,Rinn,J.andPandolfi,P.P.(2014)ThemultilayeredcomplexityofceRNAcrosstalkand competition.Nature,505,344-352. 40. Hansen,T.B.,Jensen,T.I.,Clausen,B.H.,Bramsen,J.B.,Finsen,B.,Damgaard,C.K.andKjems, J.(2013)NaturalRNAcirclesfunctionasefficientmicroRNAsponges.Nature,495,384-388. 41. Balaga,O.,Friedman,Y.andLinial,M.(2012)TowardacombinatorialnatureofmicroRNA regulationinhumancells.Nucleicacidsresearch,40,9404-9416. 42. Metpally,R.P.,Nasser,S.,Malenica,I.,Courtright,A.,Carlson,E.,Ghaffari,L.,Villa,S.,Tembe, W.andVanKeuren-Jensen,K.(2013)ComparisonofAnalysisToolsformiRNAHigh ThroughputSequencingUsingNerveCrushasaModel.Frontiersingenetics,4,20. 43. CancerGenomeAtlas,N.(2012)Comprehensivemolecularportraitsofhumanbreast tumours.Nature,490,61-70. 44. Tahiri,A.,Leivonen,S.K.,Luders,T.,Steinfeld,I.,RagleAure,M.,Geisler,J.,Makela,R.,Nord, S.,Riis,M.L.,Yakhini,Z.etal.(2014)Deregulationofcancer-relatedmiRNAsisacommon eventinbothbenignandmalignanthumanbreasttumors.Carcinogenesis,35,76-85. 45. Volinia,S.andCroce,C.M.(2013)PrognosticmicroRNA/mRNAsignaturefromtheintegrated analysisofpatientswithinvasivebreastcancer.ProceedingsoftheNationalAcademyof SciencesoftheUnitedStatesofAmerica,110,7413-7417. 46. Dai,Y.,Sui,W.,Lan,H.,Yan,Q.,Huang,H.andHuang,Y.(2009)Comprehensiveanalysisof microRNAexpressionpatternsinrenalbiopsiesoflupusnephritispatients.Rheumatology international,29,749-754. 47. Nohata,N.,Hanazawa,T.,Kikkawa,N.,Sakurai,D.,Fujimura,L.,Chiyomaru,T.,Kawakami,K., Yoshino,H.,Enokida,H.,Nakagawa,M.etal.(2011)TumoursuppressivemicroRNA-874 regulatesnovelcancernetworksinmaxillarysinussquamouscellcarcinoma.Britishjournal ofcancer,105,833-841. 48. Leidinger,P.,Keller,A.,Borries,A.,Huwer,H.,Rohling,M.,Huebers,J.,Lenhof,H.P.and Meese,E.(2011)SpecificperipheralmiRNAprofilesfordistinguishinglungcancerfrom COPD.Lungcancer,74,41-47. 49. Li,H.G.,Zhao,L.H.,Bao,X.B.,Sun,P.C.andZhai,B.P.(2014)Meta-analysisofthe differentiallyexpressedcolorectalcancer-relatedmicroRNAexpressionprofiles.European reviewformedicalandpharmacologicalsciences,18,2048-2057. 50. Paranjape,T.,Slack,F.J.andWeidhaas,J.B.(2009)MicroRNAs:toolsforcancerdiagnostics. Gut,58,1546-1554. 51. Pichler,M.andCalin,G.A.(2015)MicroRNAsincancer:fromdevelopmentalgenesinworms totheirclinicalapplicationinpatients.Britishjournalofcancer,113,569-573. 21 bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. Tables Table 1. Average classification success for classification of the 27 classes by various data transformationprotocols. Class #ofsamples Rawdata Dichotomy Trichotomy Pluralityvote ACC 80 0.7 0.86 0.96 0.96 BLCA 275 0.3 0.57 0.8 0.77 BRCA 1066 0.86 0.96 0.97 0.98 Breast 102 0.7 0.23 0.91 0.92 CESC 258 0.23 0.33 0.75 0.78 COAD* 433 0.82 0.9 0.8 0.97 ESCA 127 0.03 0.02 0.43 0.52 HNSC 513 0.8 0.89 0.89 0.92 KICH 66 0.74 0.53 0.92 0.94 Kidney 71 0.19 0.55 0.99 1 KIRC 535 0.87 0.96 0.95 0.95 KIRP 258 0.77 0.74 0.91 0.91 LGG 518 0.99 1 1 0.99 LIHC 269 0.8 0.95 0.98 0.97 LUAD 499 0.78 0.83 0.9 0.94 LUSC 467 0.6 0.69 0.84 0.87 PAAD 96 0.12 0.46 0.91 0.89 PCPG 184 0.93 0.99 0.99 1 PRAD 421 0.97 0.99 0.99 1 READ* 159 0.03 0.02 0.39 - SARC 196 0.32 0.8 0.97 0.99 SKCM 411 0.92 0.99 0.99 0.99 STAD 353 0.54 0.59 0.87 0.87 THCA 507 0.94 1 0.99 1 Thyroid 59 0.08 0 0.82 0.88 UCEC 542 0.7 0.93 0.94 0.95 UCS 57 0.08 0.15 0.79 0.8 AVERAGE 315.63 0.59 0.66 0.88 0.91 *COADandREADclassesweremergedforthepluralityvoteprotocol 22 bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. FigureLegends Figure1.Distancematrixoftherelationsbetweenlungadenocarcinomapatients’samplesandhealthy lung samples. There are 50 healthy samples and 521 diseased samples. The distance is the cosine between each pair of samples. The larger the angle, the larger is the distance between the paired miRNAexpressionvectors.Bluetoredindicatingtherangefromsimilaritytomaximaldistance. Figure2.Analysisoftherelationsbetweensamplesfromdifferenttissueorigin.(A)Distancematrixof healthysamplesfrom5sources.Thehealthysamplesarefarmoresimilarwithinthetissuethanamong tissues,substantialsimilarityisobservedbetweenhealthysamplesfromlungadenocarcinoma(LUAD) andlungsquamouscellcarcinoma(LUSC)thataremarkedaslungandlung2,respectively.(B)Matched diseasedsamplesfromthesametissuesreportedin(A).Thenumberofpatientsfromeachclassis indicated (axis Y). Recall that for the machine learning approach we excluded classes that are supported by <50 samples (see Materials and Methods). As oppose to the healthy samples, the distancesbetweenthepatientsamplesareratheruniformwithnoclearpartitionbetweendiseased classes.Bluetoredindicatingtherangefromsimilaritytomaximaldistance.(C)A'birdview'ofthe miRNA-basedinformationfor>8500TCGAsamples.All27classesarelistedinanalphabeticalorder. The classes that are associated with healthy samples (Breast, Kidney and Thyroid) are marked by arrows.Notethatthecorrelationwithinthesameclass(thediagonal)isnotnecessarilymaximal(e.g., 0.61forSarcoma,SARC).Thebrainlowergradeglioma(LGG)isanoutlierwithmaximaldistancetoany otherclass. Figure3.Classificationresultsfollowingdatatransformation.(A)Aschemeoftheworkflow.Eachknob describedeitheradatatransformationorarefinementoftheusedmethod.(B)Averagesuccessrates foreachclasswhileusingrawdata(blue),datainadichotomousrepresentation(emptydiamonds) and data following a trichotomy representation (red). (C) Fold improvement achieved by applying trichotomy transformation with respect to the results from raw data. The classes are sorted alphabetically. 23 bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. Figure4.Averagesuccessratewhenapplying200separationthresholdsfrom5to1000RPMwith5 RPM increments. The average success rate obtained for the dichotomous transformation and the trichotomoustransformationsareonTopandBottom,respectively.Thedropinperformancesupports the notion of increasing noisy data by increasing the threshold. Note a difference in the average performancefordichotomyvs.trichotomytransformations. Figure5.Acircularviewincluding27classesaccordingtotheerrortypesandthesizeofeachclass. Each class is associated with two quantifiers and their summation. The inner arc displays the distributionoftheclassesclassifiedasthesubjectedclass(i.e.,true-positiveandfalse-positive),the middlearcdisplaysthedistributionofclassificationofthesampleswhichbelongtothesubjectedclass (i.e.,true-positiveandfalse-negative).Theouterarcisthesummationoftheinnerandmiddlearcs. Thenumberofedgesandtheircolor-codeddisplaytheextentandnatureofallmissedclassifications. Notethetwo-sidederrorsfortheclassesofCOADandREAD.Eachclassiscolorcodedandtheerrors arecoloredbytheoriginofthemisclassifiedclasses.SupplementarydataTableS2depictstheaverage classificationresultsperclassfor100SVMruns.Table2isavisualizationofthedatapresentsinTable S2. Figure6.Analysisofclassificationerrorsfromsimilarsources.(A)Adiagramoftheaffiliatedtissues. Healthy and cancerous tissues derived from the same tissues are connected with a green arrow. Adjacentcanceroustissuesareconnectedwitharedarrowtoreflecttheanatomicalrelatedness.(B) The rate of intra-tissue and cancerous state tissue out of the total errors for each of the relevant classes.(C)Anexampleforthesourceofmisclassificationfocusingonthefalsepositives.Thefraction offalsepositivesoccupies12.7%.Themainclassesthatcontributetothefalsepositiveareshownalong theircontribution.Theindicated6classesshownaccountfor85%ofallfalsepositivesforthisclass (BLCA). Figure7.AveragesuccessrateachievedwhenlimitingthenumberofmiRNAs.(A)Theperformanceof theclassificationschemefollowingreductionofthemiRNAsamplesizeto50,100and200isshown. 24 bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. ThebluemarkersshowtheresultswithrandomlyselectedmiRNAsintheirrawformareused.Thered markerswererunwithrandomlyselectedmiRNAinthetrichotomyformat.Theorangemarkerswere runwiththetopinformativemiRNAinthetrichotomyform(seeMaterialsandMethods).Thebox-plot presents75%ofthedata.Thes.d.foreachcolumnfortherawdatawere6.07%,7.59%and7.19%(for 50,100and200miRNAsaccordingly).Thes.d.foreachcolumnfortherandommiRNAinthetrichotomy representationwere5.09%,4.15%and4.8%.Thes.d.foreachoftheinformativetrichotomyrunswere 0.01%.(B)Alistof20mostinformativemiRNAsaccordingtothes.d.ofthethreevectorsusedtodefine the trichotomy transformation. The miRNAs are sorted according to the s.d. for the non-expressed vector.ThefulllistofmiRNAsassociatedwiththes.d.foreachofthethreecoordinatesandrankedby thevariabilitysumisavailableinSupplementalTableS3. 25 bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license. Supplementary data Table S1. The table include two types of information: (i) List of all miRNAs includedintheTCGAdata;(ii)TCGAIDforeachofthesamplesthatwereincludedintheanalysis. Supplementary data Table S2. The table depicts the average classification results per class for 100 SVMruns.NumericsourcedataforFigure5. SupplementarydataTableS3.AlistofinformativemiRNAandtheirrespectiveranking.Datasource forFigure7B. 26 A B BRCA LIHC LUAD LUSC UCS C A A RawData RawData B Dichotomous Representation Dichotomous Representation Thrichotomous Representation Trichotomous Representation Mergingclasses (erroranalysis) Unification of READandSTAD RecallSuccess Rate 1 0.8 0.6 0.4 0.2 0 C rawdata Improvementratio trichotomy dichotomy (trichotomy/rawdata) C Ratio(Trichotomy/rawdata) 18 16 14 12 10 8 6 4 2 0 Pluralityvote MajorityVote averagesuccessrate Dichotomy 0.81 0.8 0.79 0.78 0.77 0.76 0.75 0.74 0.73 0.72 0.71 0 200 400 600 800 1000 800 1000 miRNAthreshold (RPM) Trichotomy averagesuccessrate 0.89 0.885 0.88 0.875 0.87 0.865 0.86 0.855 0.85 0.845 0.84 0 200 400 600 miRNAthreshold (RPM) A Thyroid THCA Breast BRCA LUAD LUSC UCS UCEC READ COAD STAD Kidney B KIRC KICH KIRP Cancerousstateerrors Intra-tissueerrors 100 90 80 70 60 50 40 30 20 10 0 C Miscellaneouserrors Mapping FalsePositives(>1%) BladderUrothelialCarcinoma(BLCA) HeadandNeckSquamous CellCarcinoma(HNSC) 87.3 1.1 1.1 Stomach Adenocarcinoma (STAD) 3.6 2.4 1.1 LungSquamous CellCarcinoma(LUSC) Esophageal Carcinoma(ESCA) Kidney RenalPapillary CellCarcinoma(KIRP) 1.5 CervicalSquamous CellCarcinoma& Endocervical Adenocarcinoma (CESC) A B 0.5 0.45 0.4 StandardDeviation(s.d.) 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 s.d.[trichotomy, non-expressed] s.d.[trichotomy,lowlyexpressed] s.d.[trichotomy, highlyexpressed]]