Download PDF

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

The Cancer Genome Atlas wikipedia , lookup

Transcript
bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
Lowest expressing microRNAs capture indispensable information:
identifyingcancertypes
RoniRasnic1,NathanLinial1*andMichalLinial2*
1
SchoolofComputerScienceandEngineering,TheHebrewUniversityofJerusalem,Israel
2
Deptartment of Biological Chemistry, Institute of Life Sciences, The Hebrew University of
Jerusalem,Israel
*
Correspondingauthors
MichalLinial
DeptofBiologicalChemistry(ML),TheHebrewUniversity,EdmondJ.Safra,Jerusalem,91904
[email protected]: 972-2-6584884FAX: 972-2-6586448
Figures:7/Table:1/SupplementalTables:3
Runningtitle:LowestexpressingmicroRNAsincancertyping
bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
ABSTRACT
The primary function of microRNAs (miRNAs) is to maintain cell homeostasis. In cancerous tissues
miRNAs’expressionundergodrasticalterations.Inthisstudy,weusedmiRNAexpressionprofilesfrom
The Cancer Genome Atlas (TCGA) of 24 cancer types and 3 healthy tissues, collected from >8500
samples.Weseektoclassifythecancer’soriginandtissueidentificationusingtheexpressionfrom1046
reportedmiRNAs.DespiteanapparentuniformappearanceofmiRNAsamongcanceroussamples,we
recover indispensable information from lowly expressed miRNAs regarding the cancer/tissue types.
Multiclass support vector machine classification yields an average recall of 58% in identifying the
correct tissue and tumor types. Data discretization has led to substantial improvement reaching an
average recall of 91% (95% median). We propose a straightforward protocol as a crucial step in
classifying tumors of unknown primary origin. Our counter-intuitive conclusion is that in almost all
cancer types, highly expressing miRNAs mask the significant signal that lower expressed miRNAs
provide.
Keywords:
TCGA,non-codingRNA,Cancerofunknownprimaryorigin,Machinelearning,Cancergenomics
2
bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
INTRODUCTION
MaturemicroRNAs(miRNAs)areshort,non-codingRNAmolecules.Theyleadtoposttranscriptional
repression by reducing the stability of mRNAs and attenuating the translation machinery (1). In
humans,thereareabout1900miRNAgenesthatyield~2600maturemiRNAs.Theexpressionlevelsof
miRNAsinhealthytissuesspan5to6ordersofmagnitude(2).
In all multicellular organisms including humans, miRNAs were implicated in embryogenesis, tissue
identityanddevelopment.Asthegatekeepersofcellhomeostasis(3),miRNAsrespondtoalterationin
cell regulation that occurs during viral infection, inflammation, and numerous pathologies. In
mammals, cell transformation and carcinogenesis are accompanied by drastic alterations in miRNA
expression profiles (4,5). Those miRNAs that are associated with carcinogenesis and metastasis are
called oncomiRs (6). They were implicated in targeting components of the cell cycle, DNA repair,
oncogenesortumorsuppressorgenes(7,8).MostoncomiRsareexpressedinmultiplecancertissues.
However,somearespecifictoonlycertaincancertissues.Forexample,humanmiR-21over-expression
isassociatedwithalmostallcancerswhilemiR-15andmiR-16expressionsaremostlyassociatedwith
B-cellneoplasm.OthermiRNAs(miR-143andmiR-145)directlyregulateotheroncomiRsandthusare
candidatesforanti-tumortherapy(9).Therefore,alteredmiRNAexpressionincancerousversushealthy
tissuesissuggestedasinvaluablebiomarkers,forcancerdiagnosisandprognosisandasaleadfornovel
therapeuticapproach(10).DespiteadvancesinunderstandingcellderegulationbymiRNAsincancers
formostmiRNAs,arelationbetweenexpressionlevel,diagnosisandprognosisforspecificcancertypes
cannotbedrawn(seediscussionin(11,12)).
From a clinical perspective, profiling miRNAs is important for: (i) Selecting optimal treatment; (ii)
Monitoringthedisease’sprogression;(iii)Identifyingtheprimaryoriginofametastaticcancer(12).For
example, miR-10b level was shown to be a prognosis indicator for chemotherapeutic resistance in
colorectalcancer(13)whilemiR-30cinhibitstumorchemotherapyresistanceinbreastcancer(14).The
3
bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
monitoring of minute miRNAs levels in patient’s body fluids allows a follow up for treatment and a
directassessmentfordisease’sprogression(5,15).
Manyofthesamplescollectedfromcancerpatientsaremetastatic.Evidently,itismorechallengingto
treatmetastaticthanprimarycancers(seereviews(16,17)).Importantly,theoriginofupto12-15%of
allnewlydiagnosedcancersisunknownoruncertain(18).Duetothedifficultyofidentifyingtheorigin
of metastatic tissues, cancers of unknown primary origin (coined CUP) are associated with poor
prognosisandinconclusiveprotocolfordiseasemanagement(19).Consequently,thereisanactiveline
ofresearchseekingwaystoidentifythetypeandoriginoftissuesfrommetastaticbiopsiesthrough
their molecular signatures (for example (12,20,21)). Success in classifying lung squamous cell
carcinomasfromlungadenocarcinoma(22)wasfurtherexpandedtofourmainsubtypesoflungcancer
usingmiRNAprofiles(23).Inanotherstudy,aclassifierbasedon47miRNAssuccessfullyseparated10
differenttissuesfrom~100samplesfromprimaryandmetastaticsamples(24).Expandingthetaskto
42cancertypesresultedinanextendedsetof68miRNAs(25)thattogetherareusedasindicatorsfor
tissue of origin in numerous biopsies from cancer patients (26). Although most measurements of
miRNAsoriginatefromcancerbiopsies(27),miRNAscirculatinginbodyfluidshasbecomeattractive
candidatesforearlydiagnosis(28,29).
ThegoalofourstudyistousethemiRNAexpressionprofiledatafromTheCancerGenomeAtlas(TCGA)
toward the task of classifying different cancerous tissues and their disease/healthy states. TCGA
providesrichmoleculardatafromcancerpatientsonanunprecedentedscale(30),withsamplesof>25
cancertypes.WefocusedonmiRNAprofilesthatreportonthenormalizedexpressionlevelofeachof
the1046analyzedmiRNAs.Specifically,weask:(i)WhichmiRNAsbestcapturetheinformationneeded
to distinguish the various cancerous types and origin? (ii) Which tissues are prone to false
classifications?(iii)CanwefindabiologicalinterpretationofthesetofinformativemiRNAs?
MATERIALSANDMETHODS
4
bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
Datacollection
MicroRNAs(miRNAs)profileswereextractedfromTheCancerGenomeAtlas(TCGA,collectedbetween
October2013andJanuary2014)(31-34).Wedefinedaclassbycouplingtheoriginatingtissuewiththe
labelofthesample’sstate(i.e.,cancerousorhealthy).Welimitedouranalysistoclasseswithover50
samples.Intotalwecollected27classes.ThreeclassesofhealthytissuesareabbreviatedasKidney,
ThyroidandBreast.Additionally,24cancerousclassesandtheiracronymsarelisted:adrenocortical
carcinoma(ACC),bladderurothelialcarcinoma(BLCA),brainlowergradeglioma(LGG),breastinvasive
carcinoma(BRCA),cervicalsquamouscellcarcinomaandendocervicaladenocarcinoma(CESC),colon
adenocarcinoma (COAD), esophageal carcinoma (ESCA), head and neck squamous cell carcinoma
(HNSC),kidneychromophobe(KICH),kidneyrenalclearcellcarcinoma(KIRC),kidneyrenalpapillary
cell carcinoma (KIRP), liver hepatocellular carcinoma (LIHC), lung adenocarcinoma (LUAD), lung
squamous cell carcinoma (LUSC), pancreatic adenocarcinoma (PAAD), pheochromocytoma and
paraganglioma (PCPG), prostate adenocarcinoma (PRAD), rectum adenocarcinoma (READ), sarcoma
(SARC), skin cutaneous melanoma (SKCM), stomach adenocarcinoma (STAD), thyroid carcinoma
(THCA),uterinecarcinosarcoma(UCS)anduterinecorpusendometrialcarcinoma(UCEC).
The27classesinclude8522distinctsamples.Eachsampleisassociatedwiththeexpressionof1046
miRNAs. We assessed the similarity among patient samples by calculating the cosine of the angle
betweenthevectorsofmiRNAexpression.Thesecosineswereusedasaproxytotheinitialseparation
abilityandsimilarityamongclasses.ThelistsofallsamplesandmiRNAs’TCGAidentifiersareprovided
insupplementarydataTableS1.
Datatransformation
Weapplyaroughdiscretizationtothedata,accordingtomiRNAexpressionlevel,measuredinreads
permillion(RPM).Westartwithadichotomy-onecategoryforexpressionlevelsstrictlybetween0
and30RPM,andasecondcategoryforeverythingelse.Asomewhatmorerefinedclassificationisa
trichotomy,thatpartitionsthedatatonon-expressed,lowlyexpressed(expressionlevelsfrom0to30
5
bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
RPM)andhighlyexpressed(>30RPM).Onaverage,eachsamplehas306lowlyexpressedmiRNAs(s.d.
of43.91),and163highlyexpressedmiRNAs(s.d.of18.58).
Machinelearningmodel
HumanTCGAsampleswereclassifiedtothe27classes(seeabove)bytrainingamulticlasssupport
vector machine (SVM) classifier (35). We used a Python implementation based on the scikit-learn
package(36).Thetrainingsetsfortherawdata,thedichotomyandthetrichotomywerelimitedto40%
ofthesamplesofeachclass.ThesensitivityoftheSVMclassifiertothesizeofthetrainingsetwas
tested.Thetestswererepeatedwithtrainingsetcomprisedof20%to80%ofeachclassofsamples,
with increments of 10%. Our results for both data transformations were consistent and stable for
trainingsetsofatleast40%ofthesamplesandsomewhatlessstablefortrainingsetsbelowthe40%
threshold. Both dichotomy and trichotomy schemes were run 1000-10,000 times each. Scikit-learn
cross-validationmechanismwasusedinordertopreventover-fitting.
Priortodataanalysiswerandomlypartitionalldatainto40%thatareunseenand60%thatareseen.
Wecreate5trainingsetseachofwhichiscomprisedofarandomsubsetoftwo-thirdsofseenpart.For
eachset,twoclassificationmatriceswerecreated,onebasedonthe30RPMthreshold,andtheother
basedona60RPMthreshold,forseparatingthelowly-expressedfromthehighly-expressed.Wethen
runourmachine-learningprocedureoneachofthese5setsandtestthemontheunseenpartofthe
data.Ourfinalclassificationischosenbycarryingapluralityvoteamongthese10classificationresults.
Performanceevaluationanderroranalysis
Weassessedourmodelsaccordingtotheiraveragerecallrateforeachoftheclasses.Recallisdefined
as[truepositive]/[truepositive+falsenegative].Theaverage,medianandstandarddeviation(s.d.)
weremeasuredforover300independentrunsoftheSVMprotocolonrawordiscretizeddata.
Wecarriedouttwotypesoferroranalysis:Atthesamplelevelandattheclasslevel.First,weanalyzed
the errors made per each of the samples over distinct runs (Sample-based Error). A second error
6
bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
analysisconcernsaclassmisclassificationsandtheidentityofthefaultyassignment(Class-basedError).
The plurality vote was implemented in our final classification in order to overcome Sample-based
Errors.WehavemergedtheclassesREADandCOADinordertoreducetheextentof Class-based
Errors.
InformativemiRNAextraction
Several methodologies were applied to identify and select a minimal set of miRNAs that are most
informativetowardsasuccessfulclassificationtask.Ourbestresultswereobtainedusingtheabove
described trichotomy. Specifically, for each miRNA we create an 81-dimensional vector, with 3
coordinatesforeachofthe27classes(24cancertypesand3healthytissues).Thesethreecoordinates
aredefinedasfollows:IfthereareNsamplesinthisclassforthemiRNAathand,ofwhichN_0arenot
expressed, N_1 are between 0 and 30 RPM and N_2 are highly expressed, then we set the first
coordinateatN_0/N,thesecondtoN_1/NandthethirdtoN_2/N.Wethencalculatethestandard
deviation(s.d.)ofeachoftheexpressionlevelsamongthe27classesseparately,andcalculatethesum
ofthes.d.Formally:
𝑠𝑑(𝑁_0' , 𝑁_0) , … , 𝑁_0)+ ) + 𝑠𝑑(𝑁_1' , 𝑁_1) , … , 𝑁_1)+ ) + 𝑠𝑑(𝑁_2' , 𝑁_2) , … , 𝑁_2)+ ).
WeexpectmiRNAsofhighersummeds.d.valuestobemoreinformative,sincethes.d.capturesthe
miRNA’svariabilityamongdifferentclasses.Wechosenottoassigndistinctweightstodifferentclasses
despitetheirdifferingsamples’size(rangingfrom57forUterineCarcinosarcomato1066forBreast
InvasiveCarcinoma).
RESULTS
ManycanceroustissueshavesimilarmiRNAprofiles
ThemaingoalofourstudyistoclassifycancerclassesusingtheinformationencodedbymiRNAprofiles
frompatients.Formostcancertypes,asmallnumberofhighlyexpressedmiRNAsundergoadrastic
7
bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
changeinexpressionlevelalongtheprogressionofthedisease.Therefore,itisacommonlyheldview
thatuniquelyexpressedmiRNAsmightbeusedfordiagnosis,prognosisandpossiblyalsofordisease
treatment.HerewerevisitthisviewandtestthemiRNAheterogeneityandexpressionconsistencyin
an unbiased way. We analyze many different cancer types and tissues, and thousands of patient
samples.
WewantedtogetasenseofthesimilaritybetweenthevectorsofmiRNAreadsamongsamplesfrom
eachgiventissue.WefoundhighcorrelationsamongthemiRNAexpressionvectorswhenbothsamples
are healthy or when both are diseased (within the same tissue). The correlation is much lower for
healthyvs.cancerousone(datanotshown).Figure1showsananalysisforallpairsofpatientswith
lungadenocarcinomasamples(521samples,abbreviatedLUAD)andmatchedhealthysamples(~50
samples).Thesimilaritybetweeneachpairisindicated(cosinevalues,seeMaterialsandMethods).We
showthathealthyandcanceroustissuesexhibitdifferentpatterns.Importantly,healthysamplesare
similartoeachanotherasarethediseasedsamples.
We tested whether this property (Figure 1) extends to all other cancerous and healthy tissues. We
thereforerestrictedouranalysistotissuesforwhichsamplesfrombothhealthyandcancerouspatients
areavailable.AsFigure2Ashows,eachhealthytissueiswellcharacterizedbyitsmiRNAs’profile.For
example,breasttissuesindifferentpatientshaveverysimilarmiRNAprofiles,anddiffersignificantly
from liver profiles. Even the two types of lung tissues are visibly distinguishable from each other.
However,thedistancematrixforthediseasedsamples(5classes,~2700patients,Figure2B)isrelatively
uniform.Whileallthediseasedtissuesarerathersimilarusingoursimilaritymatrix,asomewhathigher
similarityisvisiblebetweenthetwolungdiseases,theadenocarcinoma(LUAD)andlungsquamouscell
carcinoma(LUSC).WeconcludedthatthetrendthatweobservedinFigure1appliestoothertissues.
Namely,thedistanceamongallpairsofdiseasedtissuesislowerthanthatamongthecorresponding
healthytissues.
Figure2Coffersabird'seyeviewofmiRNA-basedinformationforall8522samplesthatwereincluded
inouranalysis,accordingto27classesdiscussedinthisstudy.Theentriesshowtheaverageofthe
8
bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
correlations over all samples in each pair of classes. Note that the values on the diagonal are only
slightlyhigherthantherestofthevaluesinthissymmetricmatrix.Itisnoteworthythatthatbrainlower
grade glioma (LGG) markedly differs from other cancerous tissues, yet exhibiting a significant selfsimilarity.Additionally,sarcomasamples(SARC,diagonal,Figure2C)exhibitratherlow(0.61)average
correlation,suggestinghighvariabilityofthemiRNAexpressioninthiscancertype.
LowestexpressedmiRNAsprovidedifferentiatinginformation
InviewoftherelativelyhomogenouspatternofmiRNAprofilesamongdiseasedtissues(Figure1),itis
clearthattheidentificationofcancerclassescannotbetriviallybasedonapredefinedsimplesetof
miRNAs.Thus,wesoughtaclassifierthatcancorrectlylabelatissueoforiginbasedoninformation
thatisgloballycapturedbythemiRNAs’profiles.Tothisend,weappliedamulticlasssupportvector
machine (SVM) classifier with a randomly chosen training sets (Figure 3A). The flow of the
methodologies used in this study is shown. Our classifier employs two variations of data
transformation,followingbyastepthatbenefitsfromstudyingthetypicalerrorsinourclassification
task(Figure3A).
SVMappliedtotherawdatatoidentifythetissueoforiginachievesanaveragesuccessrateof58%,
andamedianof69%(27classes)(Figure3B,blue).Notethatnaïveguessingwouldyield~4%success
rateforclassifyingasampletoanyofthe27classes.Interestingly,for10outofthe27classesattains
recall rate higher than 80%, and for 5 classes the success rate is even above 90%. However, the
predictionforsomeclassesisverypoor(<10%).ThebestperformanceisassociatedwithLGG,ascould
beanticipatedfromtheobservationinFigure2C.
However,forclinicalpurposes,asuccessrateof58%isunsatisfactory.Wethusfocusedonsearchinga
subsetofmiRNAsthataremostinformativetowardsthetaskofclassifyingtissueoforiginandsample
status(diseases,healthy).TheexpressionlevelsofmiRNAexhibitahighdynamicrangeof5-6orderof
magnitudesandexpressionisexpressedinRPM(readspermillions).Furthermore,itturnsoutthat
onlyasmallnumberofmiRNAs(20-25)accountforthevastmajorityofthereadsinthesamples(9698%, not shown). Based on these observation, we tested the potential of discretizing miRNAs’ raw
9
bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
values,thusallowingarepresentationthatemphasizesthecontributionoflowlyexpressedmiRNAs.
WeanticipatedthatthelongtailofmiRNAsmightbemoreinformativefortheclassificationpurposes
withrespecttoasmallsetofdominatingmiRNAs.
WeappliedasimpledichotomythatpartitionsthelowlyexpressedmiRNA(<30RPM)vs.therestof
themiRNAs.Typically,~70%ofthemiRNAsareexpressedattherangeof>0to30RPM.Thissimple
transformation (see Materials and Methods) dramatically reduces the volume of data, and
concentratesononly~0.02%oftheoriginalsignal(i.e.,thesumtotalofmiRNAsexpressionvalues).
Importantly,theapplicationofnaïvedichotomyalreadyimprovestheperformanceofourclassifierto
anaveragesuccessrateof66%(medianof80%,Figure3B).
Wesoughtadditionaldataprocessingprocedurestofurtherimprovetheclassificationperformance.
IndichotomywetreatequallymiRNAsthatareexpressedatlevelsabovethethreshold(>30RPM)and
thosethatarenotexpressedatallinaspecificclass.Wenextseparatethesetwosets,andreplacethe
datatransformationtoatrichotomy(Figure3A,seeMaterialsandMethods).Thistransformationyields
asubstantialimprovement.Theclassifierthusreachesanaveragesuccessrateof88%(median91%,
Figure3BandTable1).Figure3Cshowstheimprovementintheclassificationsuccessforthetrichotomy
mode with respect to the original raw data. It is clear that some classes benefit more from this
transformationthanothers.For8outofthe27classestheobservedimprovementwas>3folds(Figure
3C).
Weanalyzedthedependenceofthesuccessrateonthechosenthreshold(<30RPM)andobserved
that the recall rate remained virtually unchanged for thresholds ranging between 5 to 300 RPM.
However,raisingthethresholdabove300RPMaffectstheperformancenegatively,suggestingthatat
thisrangeofparameterweareactuallyaddingnoisyinformation(Figure4).Itisevidentthatforboth
thedichotomyandtrichotomymodes,informationwithintheverylowexpressionlevelsiscriticalfor
theclassificationtask.Thecontributionoftheverylowly-expressedmiRNAstotheclassificationtask
wasfurtherassessed.WezoomedonmiRNAsexpressedatlevelsx,wherexrangesfrom>0to30.We
trainedtheclassifieragainfordifferentvaluesofxandartificiallyrelabeledmiRNAsthataresmaller
10
bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
than x as non-expressed. When marking as non-expressed miRNAs with expression levels under 10
RPM,therewasnosignificantchangetotheclassificationresults,whileapplyingx>10RPMledtoa
declineinperformance.Typically,only~8%ofthemiRNAsarelistedintherangeof10-30RPM,while
25%oftheentirelistofmiRNAsarewithintherangeof1to10RPM.Weconcludethatlowlyexpressed
miRNAsthatcarryfundamentalinformationfortheclassifier,mostly(butnotentirely)resideinthe10
to30RPMrange.
Notalltissuetypesareequallyhardtoclassify
Asmentioned,trichotomyhasresultedinasubstantialimprovementformanyclassifiedtissues.For11
outofthe27classes,weconsistentlyreachedasuccessrateof95%orabove(Table1).Importantly,
theclassificationsuccessdoesnotimmediatelyreflectthesamplesize.Forexample,a96%successfor
Adrenocortical Carcinoma (ACC) is based on only 80 samples while lung squamous cell carcinoma
(LUSC)achievedonly84%successwithasamplesizewhichis6-foldlarger(Table1).Specifically,weare
doingpoorlyonclassifyingesophagealcarcinoma(ESCA)andrectumadenocarcinoma(READ),witha
successrateof43%and39%,respectively.
Wetestedthesensitivityoftheclassificationperformancewithrespecttoalternativemachinelearning
methods:RandomForestandSVMwithpolynomialandradialbasisfunction(RBF)kernels.Theresults
forRandomForestandSVMwithpolynomialkernelweresomewhatworsethantheusedSVM(see
Materials and Methods). RBF kernel gave an improved performance on the raw data, but failed to
improvefollowingdatadiscretizationanddatatransformation.
Athirdoftheclassificationerrorsareinterpretable
Wetriedtorefineourunderstandingonfeaturesthatgovernthesuccess/failureintheclassification
task.Tothisendweinspectedtheresultsinviewofthedifferenterrorsthataremade.Figure5presents
a“wheelview”forall27classeswithavisualindicationonthesamples’sizeforeachclass,andthe
typical errors. Cross-edges in the graph and their color indicate the extent and nature of all
misclassifications.Thewheelindicatesthetrue-positiveandfalse-positive(innerarc),aswellasthe
11
bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
true-positivesandfalse-negatives(middlearc).Fromtheoutermostarconecanestimatethesumof
thetwoerrortypes.ThesourcedataisavailableinSupplementalTableS2.
Weareoftenunabletointerprettheerrorsinclassification.Wecategorizedtheerrorsbybiological
relevanceoftheclassificationerrors:(a)Cancerousstateerrors:Correctclassifiedtissuebutfailurein
determining healthy vs. diseased state. These are restricted to the 3 instances of tissues that are
representedbybothhealthyandcancerousinstances.(b)Intra-tissueerrors:misclassificationofone
classtoanotherthatisanatomicallyadjacent.Sucherrorsarelimitedtoclasseswithmultiplediseases
of the same/ related organ. (c) Inter-class errors: classification mistakes that are not obviously
interpretable. Figure 6A shows schematically the possible error types for all 27 classes. Figure 6B
indicatesthedistributionofdifferenterrortypesamongtherelevantcategoriesoferrors.Figure6C
focusesonanexampleforinter-classerrorandthesourceforfalsepositives.Itshowsthatinthecase
ofBLCA,themajorityofthefalsepositives(10.6%outof12.7%)derivefrommisclassificationfrom6
classesofcancercarcinomas(additionalclasses,thatcontribute<1%offalsepositiveseacharenot
shown).
Improvingclassificationsuccessfromerrors’consistency
AsFigures5-6demonstrate,manyerrorsarerecurrentamongspecificclasses.Wetestedtheeffectof
mergingclassesthatarecharacterizedbyabundant,bi-directionalfalseassignmentsbytheclassifier.
The most prominent case involves Adenocarcinoma of the Rectum and Colon (READ and COAD,
respectively). Merging these two classes has improved the average recall rate, while merging other
pairsofclasseshasworsenedtheoverallclassificationsuccess(notshown).
Whenweexaminedtheerrorspersampleittranspiredthatforcertainsamples,classificationvaried
withthe(random)choiceoftrainingsets.Apluralityvoteprotocolwasappliedinordertoreducethis
dependency(seeMaterialsandMethods).Thepluralityvoteledtoanadditionalimprovementinthe
averageclassificationsuccess.Formostclassestheimprovementisrathermodest.However,merging
theRectum(READ)andColonAdenocarcinoma(COAD)classesthatwerereportedwith39%and80%
12
bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
success, respectively was highly beneficial. Actually the merged collection of samples reached 97%
recall.Theaveragerecallrateforthevalid26classeshasreachedafurtherimprovementto91%,with
amedianof95%(Table1).
Additionaltestsinwhichtheclassifierwastrainedtoidentifytheclinicalstagesofthedisease(i.e.,
cancer stage 1 versus stage 3) and other global features, like age and gender, resulted in poor
performanceandwillnotbefurtherdiscussed.
LowestexpressingmiRNAunveilinformativemiRNAs
Notwithstanding the impressive success rate of the classifier (91%, with 95% median), we are still
unabletoquantifythecontributionofanyspecificmiRNAtothissuccess.Tothisend,wedesigneda
ranking score for miRNAs according to their potential in contributing to the classifier success (see
Materials and Methods). Empirically, we compared the results achieved when classifying with a
reducedvectorofmiRNAexpression.Werandomlydrewvariousgroupsof50miRNAsoutofthe1046
miRNAdiscussedinthisstudy(TableS1)andtestedtheclassificationresultswhenusingthedataasis
(rawdata),aswellasfollowingthetrichotomytransformation(Figure7A).Wealsotestedtheresults
whenselectingthe50mostinformativemiRNA(accordingtorankbysumofthestatisticalvariation,
seeMaterialsandMethods).WerepeatedthisrandomizationtestforagrowingsetofmiRNAs.Note
that as the set size grows, a random set tends to include more informative miRNAs (e.g., for 200
miRNAs).Consequently,thedifferencesbetweentheperformanceoftherandomizedsetandthemost
informativesetisreduced.Alreadythe50mostinformativemiRNAsattainanimpressive75%success
rate,suggestingthatmuchoftheclassificationsignalisalreadycapturedbythesetthatischaracterized
bymaximalvariance(Figure7A).With300ofthemostinformativemiRNAs,successratereachedon
average87%,similartowhatisachievedwhenallthemiRNAsareemployedalongwiththetrichotomy
transformation.
WetestedwhetherthemiRNAsthatcontributemosttotheclassificationtaskareindeedthosethat
wereimplicatedincancerprogression,prognosis,anddiagnosis.Tothisend,weanalyzedthe20most
13
bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
informativemiRNAsbythecriterionusedinFigure7A.Thesourceofvariabilityisshownasreflected
bythes.d.associatedwiththe3coordinatesforeachclass(noexpression,lowlyexpressed(0<x<30)
andhighlyexpressed(>30RPM).ForacompleterankofinformativemiRNAsseeSupplementalTable
S3.ItisclearthatsomeinformativemiRNAsrelyonvariabilityinthevectorofhighlyexpressed(Figure
7B,comparemiR-205tomiR934).Forothers,novariabilityisobservedinthevectorfornon-expressed
(Figure7B,blue).Whilebothhsa-mir-141andhsa-mir-215areamongthemostinformativemiRNAs,
their source of variability is rather different. While hsa-miR-205 is characterized by a very large
variationtointhehighlyexpressedsection,yetasubstantialvariabilityisrecordedforthismiRNAalso
intheothertwocategories(non-expressedandlowlyexpression).
Discussion
CellularquantitiesofmiRNAs
OurresultsuncoverthepoweroftheleastexpressedmiRNAsindistinguishingalargecollectionof
tissuesandcanceroustypes.ThecommonnotionisthatthequantityofmaturemiRNAsinthecellis
criticalforcellregulation.Specifically,excessofaspecifictypeofmiRNApotentiallytitratesoutgenuine
mRNAtargets(37).Asasecondaryeffect,anoverflowofspecificmiRNAwillmostlikelyresultinits
bindingtonon-genuinetargets,theso-calledcalled“offtarget”effect(38).Theoverallquantitative
effectofmiRNAlevelsisdiscussedintermsofcompetition(39,40)andcooperativity(41).Moststudies
thatconsidermiRNAsinlivingcellstendtobebiasedtowardtheanalysisofhighlyexpressedmiRNAs.
Withinsuchframework,lowlyexpressedmiRNAsaremostlikelydiscardedwiththepremisethatthey
havenoimpactofthemasseffect.
In this study, we show for the first time that for the task of cancer classification, miRNA-based
informationisencodedbytheminuteexpressionofmiRNAsthattogetheraccountforonly0.02%of
thetotalreads.Moreover,itisapparentlythelongdistributionaltailofmiRNAexpressionthatcarries
thesignalforcorrectlyidentifyingmultiplecancer/tissuetypes.Weactuallyshowthatincludingthe
14
bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
entire profile of miRNAs leads to poorer performance as the principal, informative signal for class
identificationismasked.
A practical consequence of our observation concerns the preferred sequencing depth needed for a
routine miRNA profiling. The high coverage that is reported for all samples in TCGA, entails a high
sensitivity.Specifically,about45-50%ofthemiRNAs,inalmostallclassesareexpressedatalevelof0
to1RPM.ThemiRNAlistwithvaluesof1-10RPMaccountsforanadditional25%ofthemiRNAlist.
Classically,whensequencingshortRNA(<200nt),mostprotocolscallforacoveragerangingfrom0.5
to1.5Mreads,undertheassumptionthatdiscoveryratefornewmiRNAsisextremelylowabove2M
reads(42).WearguethatthehighcoverageprovidedbytheTCGAisessentialfor(i)increasedreliability
ofmiRNAidentification(i.e.,reportingonexpressionfromalistof1046miRNAs);(ii)highlightingnonclassicalandoftenlowexpressingmiRNAs.
LowexpressingmiRNAactasintrinsicmarkersfortissuespecificity
Anunexpectedoutcomeofourstudyreferstothestabilityandconsistencyintheexpressionofthe
lowlyexpressedmiRNAsthroughmanysamplesofthesamecancerclass.Ampleobservationsshow
that extreme conditions (e.g., stress, transformation, viral infection, differentiation) lead to drastic
changes in miRNA expression profiles in cell types. We postulate that in conditions where miRNA
expressionpatternsdramaticallychange(e.g.,cancer,stemcelldifferentiation,hypoxiaetc.),thetailof
the miRNA distribution is hardly affected and consequently cell identity and the origin of the cells
remainrobust.Thisinterpretationissupportedbythegradualmanipulatingofthedatabyremovalof
miRNAsthatbelongtothelowlyexpressedtailandassessingtherobustnessoftheclassifiertosuch
manipulations(Figure4).
miRNAprofilecanbeusedforsub-classificationofcancertypes
TCGAisafastgrowingresourceandhasalreadyreached>10,000sampleswhichareassignedwith33
cancertypes.Themethodologypresentedinthisstudyisapplicableforaclassificationtaskforany
numberofdiseases’classesthataresupportedbyenoughinstances.Weshowedthattheclassification
15
bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
performanceisveryhighforasmallsetofclasses(Figure3BandTable1).Forexample,thesuccess
rateforBreastInvasiveCarcinoma(BRCA)is96-98%.However,awidespectrumofsurvivalrates(43,44)
is associated with breast cancers, supporting the notion of heterogeneous molecular basis of this
disease. Therefore, managing the disease is likely to benefit from a refined classification within the
broadclassofBRCA.AprognosisprofilecombiningmRNAs,miRNAsandDNAmethylationhadbeen
proposed(45).Inadditiontothegenesthatwereimplicatedasdrivermutations,theauthorsidentified
number of informative miRNAs such as hsa-miR-328, hsa-miR-484, and hsa-miR-874. While these
miRNAswereassociatedwithcellproliferation,angiogenesisandtumor-suppressivefunctions(46,47),
they are poor separators of BRCA from other major tumor types(48). Actually, these miRNAs were
implicatedinColonAdenocarcinoma(COAD)(49)andmostlyinrenalandlungcarcinomas(48).
Importantly,eachcancertypedisplaysitsuniquemiRNAset.Forexample,overexpressedmiRNAsthat
wereimplicatedingastriccancerincludemiR-17-5p,miR-18a,miR-18b,miR-19a,miR-20b,miR-20a,
miR-21,miR-106a,miR-106b,miR-135b,miR-183,miR-340-3p,miR-421andmiR-658.Incoloncancer,
inadditiontothevalidatedoncomiRs,miR-16,miR-31,miR-34a,miR-96,miR-125bandmiR-133bwere
overexpressed. The list of cancer associated miRNAs for pancreatic ductal adenocarcinoma (PDAC)
includes miR-15b, miR-95, miR-155, miR-186, miR-190, miR-196a, miR-200b, miR-221 and miR-222
(reviewed(50)).AmongtheunderrepresentedmiRNAs(calledanti-oncomiRs)let-7a-1,miR-143and
miR-145werevalidatedforanumberofcancertypes.
We found no statistical evidence for an overlap between the ‘most informative’ miRNAs (e.g.,
supplementalTableS3,Figure7B)andmiRNAsthatarereportedintheliteraturetogoverncancer
progression.ThelaterincludesanextendedlistofoncomiRsincludingmiR-15,miR-16,miR-17,miR18,miR-19a,miR-19b,miR-20,miR-21,miR-92,miR-125b,miR-155andmiR-569.Wepostulatethat
thereisnodirectcorrespondencebetweentheinformativetailoflowlyexpressingmiRNAsandthe
dominatingmiRNAsthatbestassociatedwithatransitionofacelltoitstransformed,cancerousstate.
16
bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
Amongthetop20mostinformativemiRNAs,miR-205andmiR-141exhibitmaximalvariabilityforthe
27classes(byavectorof81values,seeMaterialsandMethods).AdditioninformativemiRNAs(Figure
7B) include miR-7-3, miR-31, miR-135a/135b, miR-149, miR-196a-1, miR-200a/200b/200c, miR-215,
miR-224, miR-328, miR-429, miR-552, miR-934, miR-944 and miR-1251. In view of the lack of any
statistical enrichment between miRNAs assigned as the most significant ones (Figure 7B) and the
collectionofoncomiRs/anti-oncomiRsfromtheliterature,wepostulatethatthemulticlasstypingand
markersfortumorigenesisrepresentingdifferentaspectsinthecharacterizationofthedisease.
Inspecting the top 50 miRNAs from our results addresses (indirectly) a question on the nature of
miRNAsthatcapturemostoftheclassificationinformation.AmongthemostvariablemiRNAs,three
ofthemiRNAs(hsa-miR-141,hsa-mir-200aandhsa-mir-200b)belongtoonefamily(mir-8)withahigh
correlationintheirappearanceinall27classes(r2=0.71-0.74).
Clinicalapplicationandpremises
Ourresultscanbeusedinclinicaltreatmentofpatientshavingcancerofunknownprimaryorigin(CUP).
However, toward improvements in clinical guidelines, additional measurements (e.g., immunehistochemical,PCRformRNAexpression)arelikelytobebeneficial.AsetofmiRNAsthatbestclassify
cancersamplesbytissueoforiginwaspresentedinviewofthesuccessindiagnosisofCUP(20,25,26).
A significant overlap between the most informative miRNAs (extracted from (26)) and our SVM
protocolwasfound.Bycomparinginformative50candidatesfromthetwostudies,wedetected10
shared miRNAs (miR-122, miR-138-1, miR-141, miR-146a, miR-196a, miR-200a, miR-200c, miR-205,
miR-31andmiR-9-3).Despitealargedifferenceintheanalyzeddata(500versus8000samples)and
themethodologyused(decisiontreevsSVM),theconsistencyamonginformativemiRNAsisintriguing
(p-value2E^5).Ampleevidencedemonstratedtheimpactofalterationinhsa-miR-200/hsa-miR-141
expressiononproliferation,morphologyandaberranthistoneacetylationinnumerouscancersandcell
types.AnoverviewonthecontributionofmiRNAsforcancerdiagnosis,theadvantagesandlimitations
ofseveralcomplementarymethodswasrecentlyreported(51).
17
bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
Lastly,weinspectedthefeaturesthatspecifytheleastinformativemiRNAsinourstudy.Thereare90
miRNAs (out of 1046) that are completely non informative to our task. Namely, in the trichotomy
transformationtheyareuniformlyexpressedwithrespecttothe3*27classes.ForthesemiRNAsthe
sumofthestandarddeviations(TableS3)foreachofthetrichotomylabeliszero.Obviously,these
miRNAs cannot contribute to the classification task. Interestingly, several validated omcomiRs are
among these miRNAs (e.g., miR-16, miR-17, miR-18 and miR-21). This emphasizes the uncoupling
between the informative low expressing miRNAs we have identified for the classification task and
highlyexpressedoncomiRs.ThelevelofexpressionofoncomiRsareusedasvaluablemarkersforcell
dysregulationandassucharethehallmarkofproliferationandtransformationincancer.
Funding
SupportedbyERCgrant339096onHighDimensionalCombinatoricsandELIXIR-Excelerategrant.
References
1.
Iorio,M.V.andCroce,C.M.(2012)MicroRNAdysregulationincancer:diagnostics,monitoring
andtherapeutics.Acomprehensivereview.EMBOmolecularmedicine,4,143-159.
2.
Baffa,R.,Fassan,M.,Volinia,S.,O'Hara,B.,Liu,C.G.,Palazzo,J.P.,Gardiman,M.,Rugge,M.,
Gomella,L.G.,Croce,C.M.etal.(2009)MicroRNAexpressionprofilingofhumanmetastatic
cancersidentifiescancergenetargets.TheJournalofpathology,219,214-221.
3.
Dumortier,O.,Hinault,C.andVanObberghen,E.(2013)MicroRNAsandmetabolism
crosstalkinenergyhomeostasis.Cellmetabolism,18,312-324.
4.
Calin,G.A.andCroce,C.M.(2006)MicroRNAsignaturesinhumancancers.Naturereviews.
Cancer,6,857-866.
5.
Volinia,S.,Calin,G.A.,Liu,C.G.,Ambs,S.,Cimmino,A.,Petrocca,F.,Visone,R.,Iorio,M.,
Roldo,C.,Ferracin,M.etal.(2006)AmicroRNAexpressionsignatureofhumansolidtumors
definescancergenetargets.ProceedingsoftheNationalAcademyofSciencesoftheUnited
StatesofAmerica,103,2257-2261.
6.
Wang,D.,Gu,J.,Wang,T.andDing,Z.(2014)OncomiRDB:adatabasefortheexperimentally
verifiedoncogenicandtumor-suppressivemicroRNAs.Bioinformatics,30,2237-2238.
7.
Jansson,M.D.andLund,A.H.(2012)MicroRNAandcancer.Molecularoncology,6,590-610.
8.
Lujambio,A.andLowe,S.W.(2012)Themicrocosmosofcancer.Nature,482,347-355.
18
bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
9.
Kitade,Y.andAkao,Y.(2010)MicroRNAsandtheirtherapeuticpotentialforhumandiseases:
microRNAs,miR-143and-145,functionasanti-oncomirsandtheapplicationofchemically
modifiedmiR-143asananti-cancerdrug.Journalofpharmacologicalsciences,114,276-280.
10.
Hayes,J.,Peruzzi,P.P.andLawler,S.(2014)MicroRNAsincancer:biomarkers,functionsand
therapy.Trendsinmolecularmedicine,20,460-469.
11.
Lu,J.,Getz,G.,Miska,E.A.,Alvarez-Saavedra,E.,Lamb,J.,Peck,D.,Sweet-Cordero,A.,Ebert,
B.L.,Mak,R.H.,Ferrando,A.A.etal.(2005)MicroRNAexpressionprofilesclassifyhuman
cancers.Nature,435,834-838.
12.
DiLeva,G.andCroce,C.M.(2013)miRNAprofilingofcancer.Currentopinioningenetics&
development,23,3-11.
13.
Nishida,N.,Yamashita,S.,Mimori,K.,Sudo,T.,Tanaka,F.,Shibata,K.,Yamamoto,H.,Ishii,
H.,Doki,Y.andMori,M.(2012)MicroRNA-10bisaprognosticindicatorincolorectalcancer
andconfersresistancetothechemotherapeuticagent5-fluorouracilincolorectalcancer
cells.Annalsofsurgicaloncology,19,3065-3071.
14.
Bockhorn,J.,Dalton,R.,Nwachukwu,C.,Huang,S.,Prat,A.,Yee,K.,Chang,Y.F.,Huo,D.,
Wen,Y.,Swanson,K.E.etal.(2013)MicroRNA-30cinhibitshumanbreasttumour
chemotherapyresistancebyregulatingTWF1andIL-11.Naturecommunications,4,1393.
15.
Bryant,R.J.,Pawlowski,T.,Catto,J.W.,Marsden,G.,Vessella,R.L.,Rhees,B.,Kuslich,C.,
Visakorpi,T.andHamdy,F.C.(2012)ChangesincirculatingmicroRNAlevelsassociatedwith
prostatecancer.Britishjournalofcancer,106,768-774.
16.
Pentheroudakis,G.,Briasoulis,E.andPavlidis,N.(2007)Cancerofunknownprimarysite:
missingprimaryormissingbiology?Theoncologist,12,418-425.
17.
Pavlidis,N.andPentheroudakis,G.(2010)Cancerofunknownprimarysite:20questionsto
beanswered.Annalsofoncology:officialjournaloftheEuropeanSocietyforMedical
Oncology/ESMO,21Suppl7,vii303-307.
18.
Fizazi,K.,Greco,F.A.,Pavlidis,N.,Daugaard,G.,Oien,K.,Pentheroudakis,G.andCommittee,
E.G.(2015)Cancersofunknownprimarysite:ESMOClinicalPracticeGuidelinesfordiagnosis,
treatmentandfollow-up.Annalsofoncology:officialjournaloftheEuropeanSocietyfor
MedicalOncology/ESMO,26Suppl5,v133-138.
19.
Greco,F.A.,Oien,K.,Erlander,M.,Osborne,R.,Varadhachary,G.,Bridgewater,J.,Cohen,D.
andWasan,H.(2012)Cancerofunknownprimary:progressinthesearchforimprovedand
rapiddiagnosisleadingtowardsuperiorpatientoutcomes.Annalsofoncology:official
journaloftheEuropeanSocietyforMedicalOncology/ESMO,23,298-304.
20.
Rosenwald,S.,Gilad,S.,Benjamin,S.,Lebanony,D.,Dromi,N.,Faerman,A.,Benjamin,H.,
Tamir,R.,Ezagouri,M.,Goren,E.etal.(2010)ValidationofamicroRNA-basedqRT-PCRtest
foraccurateidentificationoftumortissueorigin.Modernpathology:anofficialjournalofthe
UnitedStatesandCanadianAcademyofPathology,Inc,23,814-823.
21.
Varadhachary,G.R.,Spector,Y.,Abbruzzese,J.L.,Rosenwald,S.,Wang,H.,Aharonov,R.,
Carlson,H.R.,Cohen,D.,Karanth,S.,Macinskas,J.etal.(2011)Prospectivegenesignature
studyusingmicroRNAtoidentifythetissueoforigininpatientswithcarcinomaofunknown
primary.Clinicalcancerresearch:anofficialjournaloftheAmericanAssociationforCancer
Research,17,4063-4070.
22.
Bishop,J.A.,Benjamin,H.,Cholakh,H.,Chajut,A.,Clark,D.P.andWestra,W.H.(2010)
Accurateclassificationofnon-smallcelllungcarcinomausinganovelmicroRNA-based
approach.Clinicalcancerresearch:anofficialjournaloftheAmericanAssociationforCancer
Research,16,610-619.
19
bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
23.
Gilad,S.,Lithwick-Yanai,G.,Barshack,I.,Benjamin,S.,Krivitsky,I.,Edmonston,T.B.,Bibbo,
M.,Thurm,C.,Horowitz,L.,Huang,Y.etal.(2012)Classificationofthefourmaintypesof
lungcancerusingamicroRNA-baseddiagnosticassay.TheJournalofmoleculardiagnostics:
JMD,14,510-517.
24.
Ferracin,M.,Pedriali,M.,Veronese,A.,Zagatti,B.,Gafa,R.,Magri,E.,Lunardi,M.,Munerato,
G.,Querzoli,G.,Maestri,I.etal.(2011)MicroRNAprofilingfortheidentificationofcancers
withunknownprimarytissue-of-origin.TheJournalofpathology,225,43-53.
25.
Meiri,E.,Mueller,W.C.,Rosenwald,S.,Zepeniuk,M.,Klinke,E.,Edmonston,T.B.,Werner,
M.,Lass,U.,Barshack,I.,Feinmesser,M.etal.(2012)Asecond-generationmicroRNA-based
assayfordiagnosingtumortissueorigin.Theoncologist,17,801-812.
26.
Rosenfeld,N.,Aharonov,R.,Meiri,E.,Rosenwald,S.,Spector,Y.,Zepeniuk,M.,Benjamin,H.,
Shabes,N.,Tabak,S.,Levy,A.etal.(2008)MicroRNAsaccuratelyidentifycancertissueorigin.
Naturebiotechnology,26,462-469.
27.
DeCecco,L.,Dugo,M.,Canevari,S.,Daidone,M.G.andCallari,M.(2013)Measuring
microRNAexpressionlevelsinoncology:fromsamplestodataanalysis.Criticalreviewsin
oncogenesis,18,273-287.
28.
Kosaka,N.,Iguchi,H.andOchiya,T.(2010)CirculatingmicroRNAinbodyfluid:anew
potentialbiomarkerforcancerdiagnosisandprognosis.Cancerscience,101,2087-2092.
29.
Fesler,A.,Jiang,J.,Zhai,H.andJu,J.(2014)CirculatingmicroRNAtestingfortheearly
diagnosisandfollow-upofcolorectalcancerpatients.Moleculardiagnosis&therapy,18,
303-308.
30.
Monzon,F.A.andKoen,T.J.(2010)Diagnosisofmetastaticneoplasms:molecularapproaches
foridentificationoftissueoforigin.Archivesofpathology&laboratorymedicine,134,216224.
31.
Gore,J.,Craven,K.E.,Wilson,J.L.,Cote,G.A.,Cheng,M.,Nguyen,H.V.,Cramer,H.M.,
Sherman,S.andKorc,M.(2015)TCGAdataandpatient-derivedorthotopicxenografts
highlightpancreaticcancer-associatedangiogenesis.Oncotarget,6,7504-7521.
32.
Zhang,W.(2014)TCGAdividesgastriccancerintofourmolecularsubtypes:implicationsfor
individualizedtherapeutics.Chinesejournalofcancer,33,469-470.
33.
Ho,D.W.,Kai,A.K.andNg,I.O.(2015)TCGAwhole-transcriptomesequencingdatareveals
significantlydysregulatedgenesandsignalingpathwaysinhepatocellularcarcinoma.
Frontiersofmedicine,9,322-330.
34.
Zhu,Y.,Qiu,P.andJi,Y.(2014)TCGA-assembler:open-sourcesoftwareforretrievingand
processingTCGAdata.Naturemethods,11,599-600.
35.
Chang,C.C.,Hsu,C.W.andLin,C.J.(2000)Theanalysisofdecompositionmethodsfor
supportvectormachines.IEEEtransactionsonneuralnetworks/apublicationoftheIEEE
NeuralNetworksCouncil,11,1003-1008.
36.
Abraham,A.,Pedregosa,F.,Eickenberg,M.,Gervais,P.,Mueller,A.,Kossaifi,J.,Gramfort,A.,
Thirion,B.andVaroquaux,G.(2014)Machinelearningforneuroimagingwithscikit-learn.
Frontiersinneuroinformatics,8,14.
37.
Bosson,A.D.,Zamudio,J.R.andSharp,P.A.(2014)EndogenousmiRNAandtarget
concentrationsdeterminesusceptibilitytopotentialceRNAcompetition.Molecularcell,56,
347-359.
20
bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
38.
Jackson,A.L.,Burchard,J.,Schelter,J.,Chau,B.N.,Cleary,M.,Lim,L.andLinsley,P.S.(2006)
WidespreadsiRNA"off-target"transcriptsilencingmediatedbyseedregionsequence
complementarity.Rna,12,1179-1187.
39.
Tay,Y.,Rinn,J.andPandolfi,P.P.(2014)ThemultilayeredcomplexityofceRNAcrosstalkand
competition.Nature,505,344-352.
40.
Hansen,T.B.,Jensen,T.I.,Clausen,B.H.,Bramsen,J.B.,Finsen,B.,Damgaard,C.K.andKjems,
J.(2013)NaturalRNAcirclesfunctionasefficientmicroRNAsponges.Nature,495,384-388.
41.
Balaga,O.,Friedman,Y.andLinial,M.(2012)TowardacombinatorialnatureofmicroRNA
regulationinhumancells.Nucleicacidsresearch,40,9404-9416.
42.
Metpally,R.P.,Nasser,S.,Malenica,I.,Courtright,A.,Carlson,E.,Ghaffari,L.,Villa,S.,Tembe,
W.andVanKeuren-Jensen,K.(2013)ComparisonofAnalysisToolsformiRNAHigh
ThroughputSequencingUsingNerveCrushasaModel.Frontiersingenetics,4,20.
43.
CancerGenomeAtlas,N.(2012)Comprehensivemolecularportraitsofhumanbreast
tumours.Nature,490,61-70.
44.
Tahiri,A.,Leivonen,S.K.,Luders,T.,Steinfeld,I.,RagleAure,M.,Geisler,J.,Makela,R.,Nord,
S.,Riis,M.L.,Yakhini,Z.etal.(2014)Deregulationofcancer-relatedmiRNAsisacommon
eventinbothbenignandmalignanthumanbreasttumors.Carcinogenesis,35,76-85.
45.
Volinia,S.andCroce,C.M.(2013)PrognosticmicroRNA/mRNAsignaturefromtheintegrated
analysisofpatientswithinvasivebreastcancer.ProceedingsoftheNationalAcademyof
SciencesoftheUnitedStatesofAmerica,110,7413-7417.
46.
Dai,Y.,Sui,W.,Lan,H.,Yan,Q.,Huang,H.andHuang,Y.(2009)Comprehensiveanalysisof
microRNAexpressionpatternsinrenalbiopsiesoflupusnephritispatients.Rheumatology
international,29,749-754.
47.
Nohata,N.,Hanazawa,T.,Kikkawa,N.,Sakurai,D.,Fujimura,L.,Chiyomaru,T.,Kawakami,K.,
Yoshino,H.,Enokida,H.,Nakagawa,M.etal.(2011)TumoursuppressivemicroRNA-874
regulatesnovelcancernetworksinmaxillarysinussquamouscellcarcinoma.Britishjournal
ofcancer,105,833-841.
48.
Leidinger,P.,Keller,A.,Borries,A.,Huwer,H.,Rohling,M.,Huebers,J.,Lenhof,H.P.and
Meese,E.(2011)SpecificperipheralmiRNAprofilesfordistinguishinglungcancerfrom
COPD.Lungcancer,74,41-47.
49.
Li,H.G.,Zhao,L.H.,Bao,X.B.,Sun,P.C.andZhai,B.P.(2014)Meta-analysisofthe
differentiallyexpressedcolorectalcancer-relatedmicroRNAexpressionprofiles.European
reviewformedicalandpharmacologicalsciences,18,2048-2057.
50.
Paranjape,T.,Slack,F.J.andWeidhaas,J.B.(2009)MicroRNAs:toolsforcancerdiagnostics.
Gut,58,1546-1554.
51.
Pichler,M.andCalin,G.A.(2015)MicroRNAsincancer:fromdevelopmentalgenesinworms
totheirclinicalapplicationinpatients.Britishjournalofcancer,113,569-573.
21
bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
Tables
Table 1. Average classification success for classification of the 27 classes by various data
transformationprotocols.
Class
#ofsamples
Rawdata
Dichotomy
Trichotomy
Pluralityvote
ACC
80
0.7
0.86
0.96
0.96
BLCA
275
0.3
0.57
0.8
0.77
BRCA
1066
0.86
0.96
0.97
0.98
Breast
102
0.7
0.23
0.91
0.92
CESC
258
0.23
0.33
0.75
0.78
COAD*
433
0.82
0.9
0.8
0.97
ESCA
127
0.03
0.02
0.43
0.52
HNSC
513
0.8
0.89
0.89
0.92
KICH
66
0.74
0.53
0.92
0.94
Kidney
71
0.19
0.55
0.99
1
KIRC
535
0.87
0.96
0.95
0.95
KIRP
258
0.77
0.74
0.91
0.91
LGG
518
0.99
1
1
0.99
LIHC
269
0.8
0.95
0.98
0.97
LUAD
499
0.78
0.83
0.9
0.94
LUSC
467
0.6
0.69
0.84
0.87
PAAD
96
0.12
0.46
0.91
0.89
PCPG
184
0.93
0.99
0.99
1
PRAD
421
0.97
0.99
0.99
1
READ*
159
0.03
0.02
0.39
-
SARC
196
0.32
0.8
0.97
0.99
SKCM
411
0.92
0.99
0.99
0.99
STAD
353
0.54
0.59
0.87
0.87
THCA
507
0.94
1
0.99
1
Thyroid
59
0.08
0
0.82
0.88
UCEC
542
0.7
0.93
0.94
0.95
UCS
57
0.08
0.15
0.79
0.8
AVERAGE
315.63
0.59
0.66
0.88
0.91
*COADandREADclassesweremergedforthepluralityvoteprotocol
22
bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
FigureLegends
Figure1.Distancematrixoftherelationsbetweenlungadenocarcinomapatients’samplesandhealthy
lung samples. There are 50 healthy samples and 521 diseased samples. The distance is the cosine
between each pair of samples. The larger the angle, the larger is the distance between the paired
miRNAexpressionvectors.Bluetoredindicatingtherangefromsimilaritytomaximaldistance.
Figure2.Analysisoftherelationsbetweensamplesfromdifferenttissueorigin.(A)Distancematrixof
healthysamplesfrom5sources.Thehealthysamplesarefarmoresimilarwithinthetissuethanamong
tissues,substantialsimilarityisobservedbetweenhealthysamplesfromlungadenocarcinoma(LUAD)
andlungsquamouscellcarcinoma(LUSC)thataremarkedaslungandlung2,respectively.(B)Matched
diseasedsamplesfromthesametissuesreportedin(A).Thenumberofpatientsfromeachclassis
indicated (axis Y). Recall that for the machine learning approach we excluded classes that are
supported by <50 samples (see Materials and Methods). As oppose to the healthy samples, the
distancesbetweenthepatientsamplesareratheruniformwithnoclearpartitionbetweendiseased
classes.Bluetoredindicatingtherangefromsimilaritytomaximaldistance.(C)A'birdview'ofthe
miRNA-basedinformationfor>8500TCGAsamples.All27classesarelistedinanalphabeticalorder.
The classes that are associated with healthy samples (Breast, Kidney and Thyroid) are marked by
arrows.Notethatthecorrelationwithinthesameclass(thediagonal)isnotnecessarilymaximal(e.g.,
0.61forSarcoma,SARC).Thebrainlowergradeglioma(LGG)isanoutlierwithmaximaldistancetoany
otherclass.
Figure3.Classificationresultsfollowingdatatransformation.(A)Aschemeoftheworkflow.Eachknob
describedeitheradatatransformationorarefinementoftheusedmethod.(B)Averagesuccessrates
foreachclasswhileusingrawdata(blue),datainadichotomousrepresentation(emptydiamonds)
and data following a trichotomy representation (red). (C) Fold improvement achieved by applying
trichotomy transformation with respect to the results from raw data. The classes are sorted
alphabetically.
23
bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
Figure4.Averagesuccessratewhenapplying200separationthresholdsfrom5to1000RPMwith5
RPM increments. The average success rate obtained for the dichotomous transformation and the
trichotomoustransformationsareonTopandBottom,respectively.Thedropinperformancesupports
the notion of increasing noisy data by increasing the threshold. Note a difference in the average
performancefordichotomyvs.trichotomytransformations.
Figure5.Acircularviewincluding27classesaccordingtotheerrortypesandthesizeofeachclass.
Each class is associated with two quantifiers and their summation. The inner arc displays the
distributionoftheclassesclassifiedasthesubjectedclass(i.e.,true-positiveandfalse-positive),the
middlearcdisplaysthedistributionofclassificationofthesampleswhichbelongtothesubjectedclass
(i.e.,true-positiveandfalse-negative).Theouterarcisthesummationoftheinnerandmiddlearcs.
Thenumberofedgesandtheircolor-codeddisplaytheextentandnatureofallmissedclassifications.
Notethetwo-sidederrorsfortheclassesofCOADandREAD.Eachclassiscolorcodedandtheerrors
arecoloredbytheoriginofthemisclassifiedclasses.SupplementarydataTableS2depictstheaverage
classificationresultsperclassfor100SVMruns.Table2isavisualizationofthedatapresentsinTable
S2.
Figure6.Analysisofclassificationerrorsfromsimilarsources.(A)Adiagramoftheaffiliatedtissues.
Healthy and cancerous tissues derived from the same tissues are connected with a green arrow.
Adjacentcanceroustissuesareconnectedwitharedarrowtoreflecttheanatomicalrelatedness.(B)
The rate of intra-tissue and cancerous state tissue out of the total errors for each of the relevant
classes.(C)Anexampleforthesourceofmisclassificationfocusingonthefalsepositives.Thefraction
offalsepositivesoccupies12.7%.Themainclassesthatcontributetothefalsepositiveareshownalong
theircontribution.Theindicated6classesshownaccountfor85%ofallfalsepositivesforthisclass
(BLCA).
Figure7.AveragesuccessrateachievedwhenlimitingthenumberofmiRNAs.(A)Theperformanceof
theclassificationschemefollowingreductionofthemiRNAsamplesizeto50,100and200isshown.
24
bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
ThebluemarkersshowtheresultswithrandomlyselectedmiRNAsintheirrawformareused.Thered
markerswererunwithrandomlyselectedmiRNAinthetrichotomyformat.Theorangemarkerswere
runwiththetopinformativemiRNAinthetrichotomyform(seeMaterialsandMethods).Thebox-plot
presents75%ofthedata.Thes.d.foreachcolumnfortherawdatawere6.07%,7.59%and7.19%(for
50,100and200miRNAsaccordingly).Thes.d.foreachcolumnfortherandommiRNAinthetrichotomy
representationwere5.09%,4.15%and4.8%.Thes.d.foreachoftheinformativetrichotomyrunswere
0.01%.(B)Alistof20mostinformativemiRNAsaccordingtothes.d.ofthethreevectorsusedtodefine
the trichotomy transformation. The miRNAs are sorted according to the s.d. for the non-expressed
vector.ThefulllistofmiRNAsassociatedwiththes.d.foreachofthethreecoordinatesandrankedby
thevariabilitysumisavailableinSupplementalTableS3.
25
bioRxiv preprint first posted online Oct. 20, 2016; doi: http://dx.doi.org/10.1101/082081. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-ND 4.0 International license.
Supplementary data Table S1. The table include two types of information: (i) List of all miRNAs
includedintheTCGAdata;(ii)TCGAIDforeachofthesamplesthatwereincludedintheanalysis.
Supplementary data Table S2. The table depicts the average classification results per class for 100
SVMruns.NumericsourcedataforFigure5.
SupplementarydataTableS3.AlistofinformativemiRNAandtheirrespectiveranking.Datasource
forFigure7B.
26
A
B
BRCA
LIHC
LUAD
LUSC
UCS
C
A
A
RawData
RawData
B
Dichotomous
Representation
Dichotomous
Representation
Thrichotomous
Representation
Trichotomous
Representation
Mergingclasses
(erroranalysis)
Unification
of
READandSTAD
RecallSuccess Rate
1
0.8
0.6
0.4
0.2
0
C
rawdata Improvementratio
trichotomy
dichotomy
(trichotomy/rawdata)
C
Ratio(Trichotomy/rawdata)
18
16
14
12
10
8
6
4
2
0
Pluralityvote
MajorityVote
averagesuccessrate
Dichotomy
0.81
0.8
0.79
0.78
0.77
0.76
0.75
0.74
0.73
0.72
0.71
0
200
400
600
800
1000
800
1000
miRNAthreshold (RPM)
Trichotomy
averagesuccessrate
0.89
0.885
0.88
0.875
0.87
0.865
0.86
0.855
0.85
0.845
0.84
0
200
400
600
miRNAthreshold (RPM)
A
Thyroid
THCA
Breast
BRCA
LUAD
LUSC
UCS
UCEC
READ
COAD
STAD
Kidney
B
KIRC
KICH
KIRP
Cancerousstateerrors
Intra-tissueerrors
100
90
80
70
60
50
40
30
20
10
0
C
Miscellaneouserrors
Mapping FalsePositives(>1%)
BladderUrothelialCarcinoma(BLCA)
HeadandNeckSquamous
CellCarcinoma(HNSC)
87.3
1.1
1.1
Stomach
Adenocarcinoma (STAD)
3.6
2.4
1.1
LungSquamous
CellCarcinoma(LUSC)
Esophageal
Carcinoma(ESCA)
Kidney RenalPapillary
CellCarcinoma(KIRP)
1.5
CervicalSquamous CellCarcinoma&
Endocervical Adenocarcinoma (CESC)
A
B
0.5
0.45
0.4
StandardDeviation(s.d.)
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
s.d.[trichotomy, non-expressed]
s.d.[trichotomy,lowlyexpressed]
s.d.[trichotomy, highlyexpressed]]