Download Scientific Discovery in the Era of Big Data: More than the Scientific

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Scientific Discovery in the Era of Big Data:
More than the Scientific Method
A RENCI WHITE PAPER
Vol. 3, No. 6, November 2015
ScientificDiscoveryintheEraofBigData:
MorethantheScientificMethod
Authors
CharlesP.Schmitt,DirectorofInformaticsandChiefTechnicalOfficer
StevenCox,CyberinfrastructureEngagementLead
KaramarieFecho,MedicalandScientificWriter
RayIdaszak,DirectorofCollaborativeEnvironments
HowardLander,SeniorResearchSoftwareDeveloper
ArcotRajasekar,ChiefDomainScientistforDataGridTechnologies
SidharthThakur,SeniorResearchDataSoftwareDeveloper
RenaissanceComputingInstitute
UniversityofNorthCarolinaatChapelHill
ChapelHill,NC,USA
919-445-9640
RENCIWhitePaperSeries,Vol.3,No.6 1
ATAGLANCE
•
•
•
•
•
Scientificdiscoveryhaslongbeenguidedbythescientificmethod,whichisconsidered
tobethe“goldstandard”inscience.
Theeraof“bigdata”isincreasinglydrivingtheadoptionofapproachestoscientific
discoverythateitherdonotconformtoorradicallydifferfromthescientificmethod.
Examplesincludetheexploratoryanalysisofunstructureddatasets,datamining,
computermodeling,interactivesimulationandvirtualreality,scientificworkflows,and
widespreaddigitaldisseminationandadjudicationoffindingsthroughmeansthatare
notrestrictedtotraditionalscientificpublicationandpresentation.
Whilethescientificmethodremainsanimportantapproachtoknowledgediscoveryin
science,aholisticapproachthatencompassesnewdata-drivenapproachesisneeded,
andthiswillnecessitategreaterattentiontothedevelopmentofmethodsand
infrastructuretointegrateapproaches.
Newapproachestoknowledgediscoverywillbringnewchallenges,however,including
theriskofdatadeluge,lossofhistoricalinformation,propagationof“false”knowledge,
relianceonautomationandanalysisoverinquiryandinference,andoutdatedscientific
trainingmodels.
Nonetheless,thetimeisrightforincreasedfocusontheconstructionofCollaborative
KnowledgeNetworksforScientificDiscoverydesignedtoleverageexistingdatasources
andintegratetraditionalandemergingscientificmethodsandtherebydrivescientific
discoveryandapplication.
Introduction
Knowledgediscoveryinsciencereferstothesystematicprocesswherebyscientistsdrawlogical
conclusionsregardingtheworldaroundus,generatenewtheoriesbasedonthoseconclusions,
andsharefindingswithotherscientistsandthelaypublic,thusenablingcriticalreviewand
consensusbeforenewfindingsareaddedtothecollectivebodyofknowledge.Historically,
scientificdiscoveryhasbeenguidedbythescientificmethod,whichdatestoancienttimesand
involvesbothaphilosophicalandpracticalapproachtoscience.Indeed,therenowned
philosopherAristotle(384-322BC)wasoneofthefirsttoapproachknowledgediscovery
throughrigorous,systematicobservation,althoughitwasn’tuntilmillennialaterthatthe
scientificmethodwasactuallyformalizedandimplemented,largelythroughtheworkof
Copernicus(1473-1543),TychoBrahe(1546-1601),JohannesKepler(1571-1630),GalileoGalilei
(1564-1642),ReneDescartes(1596-1765),andIsaacNewton(1643-1727)(Gower1997;Betz
2011).
Atfirstglance,thescientificmethodappearstoberelativelysimpleandstraightforward.In
sum,themethodinvolvesarepeatingcycleofstandardizedsteps:theprocessbeginswith
carefulobservationofthenaturalworld,theframingofaquestiononthebasisofone’s
observations,andareviewoftheexistingbodyofknowledgetodetermineifareasonable
RENCIWhitePaperSeries,Vol.3,No.6 2
explanationalreadyexists.Assumingthatthequestionremainsopenended,ascientistwill
thenformulateahypothesis,designandimplementanexperimenttotestthehypothesis,and
analyzetheexperimentalresults.Theanalyticalresultsareusedtoformallyacceptorrejectthe
hypothesis.Thehypothesismaythenbemodified,withtheexperimentrepeatedoranew
experimentdesigned.Importantly,theresultsarethendisseminatedviapresentationtopeers
andpublication,whichallowforadjudication1orpeerconsensusregardingthevalidityofthe
scientificfindings.This,inshort,isscientificdiscoveryviathescientificmethod.
RevisitingCharlesDarwin
Asanyschoolchildwillattest,theworkofCharlesDarwinprovidesanexemplarofthescientificmethodandthe
manychallengesinvolvedinscientificdiscovery(Burkhardt1996;McKie2008;Montgomery2009).Darwin’sgenius
perhapsliesinhiskeenabilitytoobservethenaturalworld.Oneofhisearliestobservationswasthatsimilar
speciesarefoundacrosstheglobeandthatindividualswithinaspeciesarenotidenticalbuthavelocalvariation.
Hequestionedwhymultiplespeciesexistinsteadofjustone.Darwincontinuouslyresearchedandreviewedthe
existingscientificliterature(i.e.,theestablishedscientificbodyofknowledge).
Darwin’sworkwasinfluencedbytheresearchandwritingsofThomasMalthus,whofoundthathumansproduce
moreoffspringthanareneededtoreplacethemselvesandspeculatedthatpopulationsizewouldsoonexceedthe
availableresourcesrequiredforsurvival.Darwinalsoobservedthatpopulationsofplantsandanimalsstayabout
thesamesizebecauseoflimitedresourcesandcompetitionforthoseresources.Darwin’sthinkingalsowas
influencedbytheresearchandwritingsofCharlesLyell,whofoundthatsmall,gradualgeologicalprocessescan
producelargechangesovertime.Darwinmadeseveralbrilliantinferencesonthebasisofhisscientific
observationsandtheworkofMalthus,Lyell,andothersthatledhimtohypothesizethatspecieschangeslowlyin
aprocessofevolutionfromacommonancestor.Hethenspentdecadesinobservationalexperimentation,
ongoinganalysisofhisscientificfindings,andrefinementofhishypothesisuntilheeventuallyreachedthenow
famousconclusionthattheoriginofthespeciesliesinnaturalselection,ortheprocesswherebyindividual
variation,coupledwithcompetitionamongindividualsfornaturalresourcesandsocialcooperationamongkinto
increaseindividualfitness,determinesdifferencesamongrelatedspecies.
Coincidentally,whileDarwinwasrefininghishypothesisanddrawingaconclusion,AlfredR.Wallace,amuchjunior
researcherwhowasfamiliarwithDarwin’swork,reachedasimilarconclusion,whichheplannedtopublishand
2
disseminatetoscientificpeers.Awarethathisworkmightgounrecognized, Darwinreachedanagreementwith
Wallace,brokeredbyLyellandJosephD.Hooker,andreluctantlypublishedajointscientificmanuscriptin1858.
OntheOriginofSpeciesbyNaturalSelectionwasn’tpublisheduntilNovember1859,butonlyasabookchapter,
notthefullbookthatDarwinhadintended.Interestingly,thescientificcommunityreactednegativelytoDarwin’s
work(e.g.,Gray1860);manyofDarwin’speersfeltthathedidnothavesufficientevidencetoputforthhis
hypothesis,whileothersweredisturbedbythetheological,political,andsocialimplications.Thescientificdebate
continuedfornearlyacenturyuntilthescientificcommunityreachedadjudicationonDarwin’shypothesisand
scientificfindingsandestablishedthetheoryofevolutionbynaturalselection.Ofnote,thelaydebatecontinues
today,largelyontheologicalandpoliticalgrounds.
1
“Adjudication”isalegaltermthatreferstodecisionmakinginthepresenceofaneutralthirdpartywhohasthe
authoritytodetermineabindingresolutionthroughsomeformofjudgmentoraward.Inscience,adjudication
generallyreferstodecisionmakingandconsensusbuildingamongagroupofwidelyregardedscientificexpertson
thetopicunderdiscussion(Spangler2003).
2
“Ineversawamorestrikingcoincidence,ifWallacehadmyM.S.sketchwrittenoutin1842hecouldnothave
madeabettershortabstract!EvenhistermsnowstandasHeadsofmyChapters…Soallmyoriginality,whateverit
mayamountto,willbesmashed.ThoughmyBook,ifitwilleverhaveanyvalue,willnotbedeteriorated;asallthe
labourconsistsintheapplicationofthetheory”—CharlesDarwintoCharlesLyell,June18,1858(InCharles
Darwin’sLetters.ASelection,F.Burkhardt(Ed.),1996).
RENCIWhitePaperSeries,Vol.3,No.6 3
Thescientificmethodhascertaindesirablecharacteristicsthathaveenabledittowithstandthe
passageoftime,evenasnewscientifictoolsandtechniqueshavebeenintroducedtothe
scientificprocess.Forexample,themethodisobjectiveandremovedfromallpersonaland
culturalbiases.Allhypothesesaredevelopedtobeconsistentwithacceptedscientifictruthsat
thetimethatthehypothesisisgenerated.Allmeasurementsmustbeobservableandpertinent
tothehypothesis.ThehypothesisandallconclusionsmustfollowtheprincipleofOccam’s
Razorintermsofparsimony,orsimplicitywithfewassumptions.Thehypothesisalsomustbe
falsifiable(andcapableofbeingdisproven).Finally,thescientificfindingsmustbereproducible
byotherscientists.
Withoutdispute,thescientificmethodremainsthemostcommonandonlyvalidatedapproach
toscientificdiscoveryandknowledgeextraction.Infact,drugdevelopmentintheU.S.relies
exclusivelyontherandomizedplacebo-controlledclinicaltrial—theexemplarofthescientific
method.
However,today’sadvancementsindigitalcomputingandstoragecapabilities,coupledwith
newmethodsforscientificcommunication,includingsocialmedia,areintroducingnew
approachestoscientificdiscovery,eachofwhichbringschallengesandopportunities.Inthis
whitepaper,weconsiderhowtheadventofthedigitalageandtoday’sworldof“bigdata”are
changingscientificdiscoveryprocesses.WeclosewithaframeworkforCollaborative
KnowledgeNetworksforScientificDiscoverydesignedtoleverageexistingdatasourcesand
integratetraditionalandnewdata-drivenscientificmethods,allowingforunprecedented
advancesinscientificdiscoveryandapplication.
Exploratorydatasetsandexploratoryanalysis
Today’sscientisthasaccesstonumerouslargedatasetsofrelevancetomultiplescientific
domains.Forexample,theNationalOceanicandAtmosphericAdministrationmaintainsseveral
databasescontainingdataonclimatepatterns,earthquakes,ozonelevels,andocean
temperatures;thesedataareusefultoscientistsinmanyfields,includingenvironmental
science,energy,publichealth,andmedicine.Scientistsareincreasinglyaccessingthesedata
setsforexploratoryanalysis,whichisanapproachusedtodetermineifageneralhypothesis
bearsanymeritand/orifanexperimentaldesignisfeasible.Exploratoryanalysistypically
beginswithageneralhypothesisorexperimentaldesignthatisn’twellfleshedoutandthe
applicationofavarietyofstatisticalapproachesandvisualizationtechniquestoidentifyand
validatedataelementsand/ordetermineifahypothesisistestableusingthedata(Behrens
1997;Gelman2004;Diaconis2011).
Forexample,consideraneconomicsresearcherwhoisinterestedinlearningabouttheloaning
practicesofalargecreditunion,intermsofthebreakdownofmortgageloansacross
socioeconomicsectors.S/herequestsaccesstothecreditunion’sloandatabaseinorderto
determinehowthedataarestructuredandhowfine-grainedtheavailablesocioeconomicdata
elementsare.Theresearcherdeterminesthatthedatabasecontainsdataelementsontheage,
sex,andgrossincomeofloanholders,aswellastheloanamountandthelocationofthe
property,butitdoesnotcontaininformationontheethnicityoftheloanholders.The
RENCIWhitePaperSeries,Vol.3,No.6 4
researcherusesthisinformationtomodifyhis/herhypothesisforsubsequenttestingusingthe
samedata.
Exploratoryanalysishasalwaysbeenpartofscientificdiscoveryand,historically,hasbeenused
togeneratethedrivingquestionunderlyingascientifichypothesis.However,inthepast,the
processwasinformal,slow,andobservation-based(e.g.,detailednotesonthetypesoffoliage
identifiedatdifferentaltitudeswithinagivenregion);whereastoday,itisfast,large-scale,
data-drivenandofteninvolvesextensiveuseofadvancedstatisticalmethodsandvisualization
techniques.Furthermore,largedatasetsarebeinggeneratedstrictlyforexploratoryanalysis,
andoftensuchdatasetsincuraconsiderableexpensewithquestionablecost-benefit.3Arecent
McKinseyreport(Manyika,etal.2015),forexample,estimatesthatlessthan1%ofdata
capturedandstoredfromanoffshoreoilrigequippedwith30,000sensorsisactuallyusedto
guideoperations.Thevalueoftherestofthedataremainstobedetermined.
Today,exploratoryanalysisisfast,large-scale,data-drivenandofteninvolves
extensiveuseofadvancedstatisticalmethodsandvisualizationtechniques.
Intermsofbenefits,exploratoryanalysisenablesascientisttoquicklydevelopamorerefined,
testablehypothesisandrigorousexperimentaldesignforsubsequenthypothesistesting
(Behrens1997;Gelman2004;Diaconis2011).Challengesincludethecostsinvolvedwiththe
generationandlong-termstorageofexploratorydatasets.Inaddition,drawingconclusions
fromananalysisofexploratorydatasetscanintroduceerrorifthedataelementsarenot
describedwithsufficientmetadatatofullyunderstanddatastructureandmeaningorifthe
dataelementshaveinherentbiasesduetohowtheywerecollected.Additionalchallenges
includeissuesofprivacyandtheinadvertentorintentionalleakageof“sensitive”data,
particularlyifascientistdoesnotobtaintherequisiteauthorizationtoaccessadatasetthat
containssensitivedataorifsensitivedataareaccidentallyprovidedtoanunauthorizedorthirdpartyuser(Behrens1997).
Datamining
Dataminingoftenbeginswithoutahypothesisandinvolvestheapplicationoftoolsand
techniquesfromstatistics,mathematics,andvisualizationtoidentifypreviouslyunknown
patternsandtrendsinadatasetderivedfromoneormoreexistingdatabases(Fayyad,etal.
1996;Holzinger,etal.2014).Whiledataminingissometimesconsideredaformofexploratory
dataanalysis(Behrens1997;Diaconis2011),weargueforadistinctionbetweenexploratory
dataanalysis,whichenablesascientisttoexploreadatasetintermsofstructureandtypesof
dataelements,includingtherelevancyofexternaldatasources(e.g.,literaturecitations)to
3
Thecostsandbenefitsoftheseexploratorydatasetsareanopentopicofdiscussion;forexample,see
Wilhelmsen,etal.2013andEvans,etal.2015fordiscussionsontheprosandconsofcapturingandstoringwhole
genomeversustargetedgenomesequencingdata.
RENCIWhitePaperSeries,Vol.3,No.6 5
dataelements,anddatamining,whichenablesascientisttoidentifypreviouslyunknown
patternsinadatasetintermsofrelationshipsbetweendataelements.Wenotethatnew
statisticalapproachesarebeingdevelopedtoovercomesomeofthelimitationsofbothtypes
ofscientificdiscovery.Forexample,themaximalinformationcoefficient,orMIC,overcomes
thelimitationsofPearson’scorrelationcoefficient,r,byallowingfortheidentificationof
complex,nonlinearrelationships(e.g.,exponential,periodic)betweendataelementsindata
setsofanysize(Reshef,etal.2011).4Supposeamicrobiologistgeneratesageneexpression
datasettoidentifygenesinvolvedintheregulationofthecellcycle.Usingtraditionalstatistical
approaches,thescientistisabletoidentifygeneswithstrongdeterministicorlinear
associationswithdifferentaspectsofthecellcycle(e.g.,agenethatisrequiredforchromatin
assembly).UsingapproachessuchastheMIC,thescientistisabletoidentifygeneswith
periodicrelationshipswiththecellcycle(e.g.,ageneassociatedwithanestablishedcyclical
eventoraneventthatoccursatapreviouslyunknownfrequencyduringthecellcycle).
Dataminingbearslittleresemblancetothescientificmethod.Inparticular,dataminingisnot:
(1)constrainedtobeobjective;(2)consistentwithacceptedscientifictruths;(3)parsimonious;
or(4)falsifiable.Inaddition,allmeasurementsareobservableonlyafterthefact.Nonetheless,
dataminingoffersbenefits,includingtheabilitytorelativelyquicklydiscoverpreviously
unknownrelationshipsandtherebygenerateanewhypothesisthatcanthenbetestedusing
thescientificmethod(Fayyad,etal.1996;Holzinger,etal.2014).Althoughoftencitedasa
criticismofdatamining,akeybenefitistheabilitytogeneratemultipleassociationswithina
singledataset,whichmayyieldpowerfulnewinformation,especiallywhencombinedwith
associationsidentifiedinotherdatasets.Inthisregard,dataminingcanbeviewedasakinto
meta-analysisofclinicaltrialdata.
Akeybenefitofdataminingistheabilitytogeneratemultipleassociationswithin
asingledataset,whichmayyieldpowerfulnewinformation,especiallywhen
combinedwithassociationsidentifiedinotherdatasets.
Challengesincludethefactthatdataminingrequiresextensivetrainingbeyondtheskillsofa
typicaldomainscientist;withoutsuchtraining,ascientistmayidentifypatternsorassociations
thatarenotvalidorreproducible(Fayyad,etal.1996;Behrens1997;Diaconis2011;Holzinger,
etal.2014).Further,thereisnocommonlyacceptedapproachormethodfordatamining,
whichmakesthefieldsomewhatmoreofanartthanascience(Fayyad,etal.1996;Diaconis
2011).Thereisalsoatendencytotreatallfindingsasconclusivewhentheymaybechance
4
Asanaside,wenotethatthetheoreticalfoundationoftheMICliesintheconceptofmutualinformation(MI)in
pairsofrandomvariables,whichwasdevelopedbyClaudeShannon,thefounderofinformationtheory,morethan
50yearsago(Speed2011).TheimplementationofMIintopracticeasMICcannotbeachievedthroughmanualor
semi-manualcomputationandinsteadrequiressignificantdigitalcomputationalpower—somethingnotavailable
untilrecently.
RENCIWhitePaperSeries,Vol.3,No.6 6
findings(Fayyad,etal.1996;Diaconis2011).Inaddition,whilethepowerofdatamining
increaseswhenmultipleheterogeneousdatasetsareintegratedbeforethedataaremined,so
toodoesthecomplexityoftheprocess(Fayyad,etal.1996;Holzinger,etal.2014).Withhighdimensionaldata,multipleapproachesmustbeusedtoanalyzethedata;forexample,
sophisticatedvisualizationapproachesmayneedtobecombinedwithtraditionalstatistical
approachestodatamining(Fayyad,etal.1996;Diaconis2011;Holzinger,etal.2014).
Computermodeling
Computermodelinginvolvesconceptual,mathematical,computer-generated,orphysical
representationsofreal-worldobjectsorphenomenon(Bowers2012;Buytaert,etal.2012;
Perraetal.,2012;Berman,etal.2015).Itisusedtotestahypothesisorobserveand
manipulateanobjectorphenomenonthatisotherwisedifficult(orunethical)toobserveand
manipulate.
Asanexample,considerascientistwhowishestodeterminehowthebeta-amyloidprotein
influencesthedevelopmentofAlzheimersDisease.S/hedevelopsasoftwareprogramusing
establishedprinciplesofproteinfoldingtovisuallyexplorein3-Ddifferentscenarioswhereby
beta-amyloidmayfoldinthebraintoinfluencecognitivefunctionandleadtothedevelopment
ofAlzheimersDisease.
Modelingrepresentsafundamentalaspectofscientificdiscoveryandhasbeenused
throughoutthehistoryofmodernscience(e.g.,anatomicalmodels).Theuseofcomputer
modelingisrelativelynew,however,andassuch,computer-drivenmodelingtendstobea
highlyspecializedtoolasopposedtoageneraltool.Benefitsofcomputermodelingincludethe
additionofapotentiallypowerfultooltotheformerlymanualprocess,particularlywhen
modelsincorporatethevaststreamsofdatathatareavailablefromtechnologiessuchas
crystallographyandothersophisticatedsourcesofdata(Perra,etal.2012;Berman,etal.2015).
Computermodelsbecomeevenmorepowerfulwhentheyaregeneratedusingdataderived
frommultidisciplinaryscience(Bowers,etal.2012;Buytaert,etal.2012).
Challenges,however,includethefactthatcomputer-generatedmodelsareonlyasgoodasthe
underlyingdataandsoftwareprogramsusedtocreatethem(Joppa,etal.2013;Berman,etal.
2015).Inaddition,theriskofintroducingerrorincreasessignificantlywhenmodelsarecreated
forpoorlyunderstoodobjectsorphenomenon,whenthedataarequalitativeorotherwise
describedinanon-standardizedformat,orwhenmodelsdevelopedinonefieldareapplied
(withoutvalidation)toanotherfieldorevensharedbetweendifferentresearchgroupswithin
thesamefield(Bowers,etal.2012;Buytaert,etal.2012;Perra,etal.2012;Joppa,etal.2013;
Bergman,etal.2015).Onceintroduced,errorstendtopropagateandmayevenamplify
(Buytaert,etal.2012).Moreoever,manymodelsarenotdesignedtoworkinadynamic,
flexible,user-driven,webenvironment(Buytaert,etal.2012).Finally,modelingcostscanbe
quitehigh,sometimeshigherthanthecostsoftheinstrumentsusedtogeneratethedatathat
areusedinthemodels(Berman,etal.2015).
RENCIWhitePaperSeries,Vol.3,No.6 7
Interactivesimulationandvirtualreality
Interactivesimulationandvirtualrealityaresimilartomodelinginthatacomputerisusedto
createrealisticscenariosfortestingahypothesisormanipulatinganobjectorphenomenon
thatisotherwisedifficult(orunethical)totestormanipulate.However,thedifference,as
definedherein,isthatwithinteractivesimulationandvirtualreality,humanbehavior,notthe
underlyingsoftwareprogram,guidestheoutcomeofthesimulationorvirtualexperience(Zyda
2005;Kunkler2006;Diemer,etal.2015).
Supposeamedicaldevicecompanyhasbeendevelopinganewanesthesiamachine.Inorderto
obtainapprovalfromtheU.S.FoodandDrugAdministration,thedevicehastobetestedfor
safetyinPhaseIandIIclinicaltrials.Toachievethis,thecompanycreatesa“dummy”patient
thatisprogrammedtomimicphysiologicalresponsesduringparticulartypesofsurgeriesina
simulatedoperatingroom.APhaseIclinicaltrialisthenconductedinthesimulated
environment,usingateamofactualsurgeons,anesthesiologists,andnursesasstudysubjects.
Interactivesimulationandvirtualrealitycanbeusedforhypothesistestingaspartofthe
scientificmethod,buttheyalsocanbeusedforexploratoryanalysis.Thebenefits
includetheabilitytoinvestigatehumanbehaviorinthecontextoftechnologyorcomputerassistedscenariosthatresembletherealworld(Zyda2005;Kunkler2006;Diemer,etal.2015).
Thus,theapproachfacilitatesbothteam-basedscienceandscienceonhumanteams.Several
challengesexist,however.First,ateamwithexpertiseinbothsoftwaredevelopmentand
humanbehaviormustcreatethesoftwareprogramsthatdrivethesimulationsandvirtual
scenarios,withcarefulattentiontoeverydetailofthescenario(Diemer,etal.2015).Also,
humanbehaviorinasimulatedenvironmentmightnotbeidenticaltohumanbehaviorinthe
realworld;atpresent,softwareprogramscannotfullycapturethetruehumanexperience,
includingbehaviorssuchasemotionalreactions(Kunkler2006).Finally,technological
challengesgrowexponentiallywiththecomplexityofthescenario,andupfrontcostsarehigh
(Zyda2005).
Scientificworkflows
Scientificworkflowsrefertotheabstractstepsrequiredtocompleteaspecificscientifictask.
Today,thesearetypicallydesignedasspecializedworkflowmanagementsystemsthatare
capableoforchestratingandautomatingmanyofthesteps,includingdataflowand,insome
cases,personnel(e.g.,scheduling,accessrights,alertsorreminders),requiredtocompletea
specifictaskand/ortestahypothesis(es)(Curcin&Ghanem2008;Perraud,etal.2010;
Achilleos,etal.2012;Bowers2012;Guo2013).Theconceptoftheautomatedscientific
workflowappearstohaveoriginatedwithanundatedpublicationbySinghandVouk(see
references).Scientificworkflowsareusedtoautomatetedious,time-consuming,orhighly
complexstepsinexperimentaltestingand/oranalysis.
Forinstance,imaginethatabiomedicalresearchcompanyreliesonflowcytometry,a
techniquethatenablesthevisualizationandquantificationoftheproteincompositionoflive
cells,asacriticalpartofitsdrugdiscoveryeffortsindiabetes.Thecompanydecidesto
RENCIWhitePaperSeries,Vol.3,No.6 8
automatemuchoftheprocessandhiresasoftwareengineeringteamtodesignascientific
workflowsystemtoautomateandcoordinatethetechnologiesandresearchteamsinvolvedin
eachstepoftheprocess,includingisolationofbloodcellsfrompatientswithdiabetes,
incubationofcellswithconjugatedantibody(ies),multiplewashes,immunofluorescence
visualization,andanalysistodeterminecellularsubpopulations.
Thedesignofmostscientificworkflowsystemsmodelstheactualscientificmethod,exceptthat
themeasurementsareautomatedandnotalwaysobservablebythescientist.Moreover,
scientificworkflowsareincreasinglybeingusedforautomateddeductionandinference,which
arestepsthathistoricallyreliedonthescientist.Intheexampleabove,forinstance,theuseof
flowcytometrytoidentifycellularsubpopulationsonthebasisoftheirfluorescentprofilecan
bemoreofanartthanascience,especiallywhenmultiplefluorescentconjugatesareused;as
such,automatingtheprocesscanyielderroneousorinconsistentresults.Thatsaid,scientific
workflowsenableascientisttoconductexperimentsmorequickly,efficiently,andwithgreater
statisticalpowerandreproducibilityduetothelargedatavolumesthatawell-designed
scientificworkflowcanhandleandtheabilitytointegrateheterogeneousdatasources,oftenin
realtime(Perraud,etal.2010;Achilleos,etal.2012;Bowers2012;Guo2013).
Scientificworkflowsareincreasinglybeingusedforautomateddeductionand
inference,whicharestepsthathistoricallyreliedonthescientist.
Whilepotentiallyquitepowerful,thescientistintherealmoftheworkflowisoftendependent
onthetechnologyandremovedfromcriticaldecision-makingsteps(e.g.,calculations,
incubationtimes,etc.),which,intheabsenceofcarefulsystemdesignandimplementation,
threatensthevalidityandreproducibilityofworkflowfindings.Furthermore,unlessthe
workflowstepsareclearlyannotatedandopenlyaccessible,apeerreviewerwillbeunableto
fullyevaluateandreproducethescientificfindings(Bowers2012).Moreover,workflowdesign
canbequitecomplex,involvinghundredsofindividualanalysissteps,largesetsof
heterogeneousdata,andmultipleexistingworkflowsthatoftenwerenotdesignedtowork
together;thecomplexitypresentsdifficultiesinuser-friendliness,design,andcaptureand
recordofprovenance5(Perraud,etal.2010;Achilleos,etal.2012;Bowers2012;Gup2013).
Thereisalsoapracticallimittothedegreeofgranularityinworkflowtasks;highlygranular
activitiesaretypicallynotfeasibleduetoanegativeimpactonoverallsystemperformance
(Perraud,etal.2010).Anadditionalconcernisthatscientificworkflowscanintroduce
systematicbiasinthaterrors(e.g.,anincorrectcalculation)maybeintroducedandpropagated
indefinitelybecauseautomationmakesitmoredifficulttodetectandcorrecterrors.Finally,
workflowstendtobeveryspecializedandoftencannotberealisticallyadaptedfordifferent
domains(Curcin&Ghanem2008).
5
“Provenance”referstothecaptureofmetadatathatdescribestheoriginsofthedata,eachstepofdata
transformationandanalysis,andarecordofversioningtoidentifyhowup-to-datethedataare(Guo2013).
RENCIWhitePaperSeries,Vol.3,No.6 9
Disseminationandadjudication
Disseminationandadjudicationofscientificdiscoveriesarecriticalcomponentsofscientific
discovery,asthesearetheprocessesthroughwhichnewscientific“truths”areaddedtothe
collectivescientificbodyofknowledge(Smith2006;ACSpublications2013;Almeida2013;Cold
SpringHarborLaboratory2014).Disseminationhashistoricallybeenachievedthrough
presentationtoscientificpeersandpublicationinscientificjournalsinordertoobtaincritical
peerreviewandvalidation(orrejection)ofscientificfindingsandinformthescientificandlay
communities.Adjudicationisthemethodbywhichaconsensusisreachedamongscientific
expertsregardingthevalidityofthescientificfindings(alsoseefootnote1).
Asanexampleofthedisseminationandadjudicationprocesses,assumeascientistdeliversa
PowerPointpresentationonnewtechnologiesforhydrologyatascientificconferenceonwater
safety.S/hereceivesimmediatefeedbackonthepresentationfromcolleaguesinattendanceat
thelecture.Onthebasisofthatfeedback,thescientistdecidesthatanadditionalexperimentis
requiredbeforehis/herworkisreadyforpublicationinapeer-reviewedjournal.Toprovide
anotherexample,imagineascientistintendstopublishhis/hernewceramicsmodelina
scientificjournalfocusedonmaterialsscience.Thejournalrequirescriticalpeerreview.6The
scientistsubmitsthemanuscripttothejournalforpotentialpublication,anditisreviewedby
anonymouspeers.Onthebasisofthepeerreview,theeditordecidesthatbeforethe
manuscriptcanbepublished,thetextneedstoberevisedtobetterframetheneedforanew
ceramicsmodelinthecontextofexisting,similarmodels.
Historically,disseminationandadjudicationhavebeenkeycomponentsofthescientificmethod
andthefinalscreenbeforenewscientifictruthsareaddedtothescientificbodyofknowledge.
Thebenefitsaretremendousbecausetheprocessenablesscientificfindingstoberigorously
evaluatedbyexpertscientists,thusensuringtherobustnessandintegrityofscientificfindings
(Smith2006;ACSpublications2013;Almeida2013;ColdSpringHarborLaboratory2014).The
challenges,however,aregrowingintheeraofbigdata.Specifically,thedigitalworldis
changingthewaythatscientificfindingsaregenerated,presented,published,andcatalogued.
Forexample,traditional,paper-based,peer-reviewedscientificjournalsarecompetingwith
new,electronic,openaccessscientificjournals7thatmaynotrequireaslow,rigorouspeer
review,butonlyaquick,minimaleditorialreviewbeforepublication(Almeida2013;ColdSpring
6
“Peerreview”referstotheprocesswherebyscientificjournalsorscientificfundingorganizationsenlist(forfree)
thehelpofcolleagueswithinthesamefieldasthemanuscriptorgrantproposal’sinvestigativeteamtoevaluate
thescientificmeritandintegrityofadocument(Smith2006).
7
“Openaccess”scientificjournalsareonlinejournalsthatpermitaccesstoalljournalcontentwithouta
subscriptionfee.(ColdSpringHarborLaboratory2014).Thepremiseistoallowordinarycitizenstoaccessall
scientificfindings,particularlythosethataregeneratedwithfederalmoney,asa“publicgood”(e.g.,
http://publicaccess.nih.gov/),muchthesamewaythatfederalsalariesarereleasedtothepublicforevaluation.
“Publicgoods”,asdefinedbyeconomists,arethoseitems(commoditiesorservices)thatarebothnon-excludable
andnon-rivalrous;examplesincludepublicparks,sewersystems,highways,policeservices,etc.(Samuelson1954).
Ironically,manyjournalsrequiretheauthortopayafeeforopenaccesspublication. RENCIWhitePaperSeries,Vol.3,No.6 10
HarborLaboratory2014).Inaddition,manyscientistsarenowpublishingnon–peer-reviewed
scientificfindingsinelectronicform,viawebsites,digitalwhitepapersandtechnicalreports,
blogs,webcasts,YouTubevideos,etc.(ACSpublications2013;Almeida2013).Thisapproach
allowsforrapiddisseminationofscientificfindings,butitbypassesboththepeerreview
processforpublicationandthepeerselectionprocessforspeakerpresentationatascientific
meeting.Thatsaid,thecurrentpeerreviewprocessisnotwithoutflaw,intermsofbiasand
error(Smith2006),soperhapstheshifttodigitalself-publicationwillreformthepeerreview
processorkindleanentirelynewformofpeerreview.NewhybridmodelssuchaseGEMs
combinepeerreviewwithfreepublicationandopenaccesspermissions;whetherthesemodels
arefiscallysustainablehasyettobeseen.Anotherchallengeisthewealthofscientific
informationavailabletoday.Thisdisseminationdelugemakesitchallengingtoadjudicate
existingfindingsandidentifyimportantnewfindingsthatmaybecomelostinthewealthof
availabledata(i.e.,the“longtailofscience”)(Trader2012).
Knowledgebases
Traditionally,thedisseminationandadjudicationofscientificknowledgeinvolvesocial
processes;howeverthestorage,integration,search,andinferenceofknowledgeisrapidly
changingtoahybridmodelthatreliesequallyuponhumansandlarge,complex,collectionsof
digitalinformationthatincludedata,relationshipsbetweendataelements,humanassertions
onthedata,and,insomeinstances,capabilitiesforautomatedcognitiveprocessing:
knowledgebases.Theterm“knowledgebase”wasintroducedinthe1970s(Jarke,etal.1978)to
distinguishitfromtheexistingdatabaseofthetime,whichessentiallystoreddataintabular
formforuseraccessviaquery.Thetermremainslooselydefined,butaknowledgebasecanbe
consideredtobeacomputingsysteminwhichassertionsarerepresentedandpersistedwithin
anoverarchingontologicalframeworkthatprovidesthesemanticcontextandthatallowsfor
queryingandcomputingacrossassertions.Thehumanuserisintegraltotheknowledgederived
fromandcontainedwithintheknowledgebase;and,insomecases,thesystemcanbe
structuredtoderivenew,automatedassertionsbasedonmachinelearningornewdata.While
traditionaldatabaseshaveevolvedtobecometransaction-basedandrelationalandcanbepart
ofaknowledgebasesystem,knowledgebasesremaindifferentiatedinthatthehumanuser’s
abilitytodynamicallyaccess,addto,andusetheknowledgeisanessentialpartofthedesign
andfunctionofthesystem(e.g.,Wikipedia)(Pugh&Prusak2013).
Completelyautomatedknowledgebasesareunderdevelopment,butautomatedcurationof
humanassertionsisnotyetpossibleandmayneverbe.Nonetheless,emerging
knowledgebasesareincorporatingsophisticatedalgorithmsforautomatedandsemiautomatedstructuring,parsing,summarization,retrieval,andvisualizationofinformation.
Leading-edgeproductionknowledgebasescurrentlycontain109to1011assertionsminedfroma
varietyofunstructuredandsemi-structureddatasources,includingWikipediaandPubMed
(Bordes,etal.2015;Southern2015).Prominentcommercialexamplesofknowledgebases
includetheIBMWatsonsystemandElsevierPathwayStudio;proprietaryexamplesinclude
WalMart’sSocialGenomeandGoogle’sKnowledgeGraph;opensourcesystemsinclude
OpenBEL,OpenPHACTS,andDARPA’sDeepDive.OtherexamplesincludetheSemanticWeb
andtheLinkedDatamovement.
RENCIWhitePaperSeries,Vol.3,No.6 11
Theuseofknowledgebasesinscientificresearchisgrowing.Forinstance,theGenBank
knowledgebaseallowsonetosearchforspecificgenesandthenidentifyawealthofcurated
informationaboutthatgeneandrelatedgenesandexternaldatasources.TheGeneOntology
(GO)knowledgebaseprovidesannotationongenes,biologicalprocesses,cellularcomponents,
andmolecularfunctionsandhasbeenprominentintheapplicationofknowledgebasesfornew
statisticalapproachessuchasenrichmentanalysis(Mi,etal.2013).Asscientistsbecomemore
awareofthebenefitsofknowledgebases,andasthequalityandquantityoftheirassertions
increase,theirapplicationinscientificdiscoveryisexpectedtogrow.
Emergingknowledgebasesareincorporatingsophisticated
algorithmsforautomatedandsemi-automatedstructuring,parsing,
summarization,retrieval,andvisualizationofinformation.
Majorchallengesinthedevelopmentofknowledgebasesforscientificdiscoveryincludethe
costsofhumancuration,intermsoftheamountoftimeandexperiencerequiredtocontribute
toaknowledgebaseinameaningfulway,andthelackofincentivesforscientiststocontribute
toknowledgebases,especiallythoseperceivedtobeoutsideofascientist’sareaofexpertise.
Asexemplifiedintheknowledgebasesystemslistedabove,recentresearchhasadvancedour
understandingofhowtoconstructknowledgebases,includingtheincorporationofprobabilistic
models,structuredrepresentationofunstructureddata,deep-learningsystems,questionanswerschemas,andnaturallanguageprocessingalgorithms.Recentresearchalsohas
advancedourunderstandingofhowtointegrateknowledgefromdifferentsourceswith
differingqualityanddifferingsemantics(Southern,2015;Nickel2015).Forexample,DeepDive
isbuiltuponaprobabilisticgraphmodelthatallowsfortheencodingofassertionsand
relationshipsbetweendataelementsthathaveinaccuracies,suchasthosethatarederived
frompredictivemodelsanddata-drivenapproaches,alongwithassertionsthathavehigher
validity.Thesystemusesinferenceenginestominetherelationships,aswellasthestrengthof
theassertions,inordertoconstructdifferent“worldmodels”uponwhichnewinferencescan
bemade.Theseadvancesshowpromise,butthehumanuser,thescientist,remainsthe
roadblockinthefurtherdevelopmentandwidespreadapplicationofknowledgebasesfor
scientificdiscovery.
TheBigPicture:TheFutureofKnowledgeDiscoveryinScience
Werecognizethattherehavealwaysbeenscientificapproachesthatdonotrelyonthe
scientificmethod,asothershavearguedpreviously(e.g.,Cleland2001).Yet,thescientific
methodremainsthe“goldstandard”intermsofscientificdiscovery.Thewealthofdata
availabletodayofferstheunprecedentedopportunitytoconductscienceinwaysneverbefore
envisioned:real-timescience,real-worlddata,andnewmethodsofscientificdiscovery.In
embracingnewapproachestoscientificdiscoverythatarenotnecessarilyalignedwiththe
RENCIWhitePaperSeries,Vol.3,No.6 12
scientificmethod,ormaybeevencontradictit,wemustconsiderhowtobestemployandunify
thenewapproacheswitheachotherandwiththetraditionalscientificmethod.
JohnTukey,consideredbymanytobethemostinfluentialstatisticianinmoderntimes,once
stated:“Thebestpartofbeingastatisticianisthatyougettoplayineveryone’sbackyard.”
AmongTukey’smanyaccomplishmentsaretheintroductionoftheterms“bit”(in1946)and
“software”(in1958)andtheconceptsandmethodsofexploratorydataanalysis,datamining,
theTukeyFastFourierTransformation,andavarietyofotherstatisticalandmathematical
approachesforscientificdiscovery,manyofthemalsonamedafterhim(Bittrich2000;
Leonhardt2000;Brillinger2002).Tukeyalsointroducedtheterm“uncomfortablescience”,
whichhasbeendescribedaslyingsomewherebetweenclassicalmathematicalstatisticsand
“magicalthinking”(Diaconis2011).Theconceptisthatinmanyinstances,scientificinference
canandmustbemadefromintuitionandexploration,ratherthancontrolledexperimentation,
usingafiniteamountofpotentiallyrichdatathatareflawedandoftennonreplicable.8Tukey
diedin2000,beforetheintroductionoftheterms“bigdata”and“InternetofThings”,butone
canbetthathe’dbethefirsttotakeadvantageofthemanyopportunitiesthatthe
“datafication”ofsocietyhasprovided(Bertolucci2013).
Thewealthofdataavailabletodayofferstheunprecedentedopportunityto
conductscienceinwaysneverbeforeenvisioned:real-timescience,real-world
data,andnewmethodsofscientificdiscovery.
Anotherforward-thinkingscholaristhelateJames(Jim)Gray,whoselastpositionwasasa
TechnicalFellowandManageroftheBayAreaResearchCenteratMicrosoft.Inadditiontohis
pioneeringworkonlargedatabasesandtransactionprocessingsystems,Grayintroducedthe
conceptoftheFourthParadigmforscience(Hey,etal.2009).9WhileGray’sinitialfocuswason
exploratoryanalysisanddata-intensivescientificcomputation,theFourthParadigmessentially
representstheradicaltransformationofthescientificmethodfromhypothesis-drivento
hypothesis-generatingscience,asdescribedherein.10Gray’sparadigmwasdevelopedinthe
late1990sandearly2000s,buthisgeniusliesinhisvisionfortoday’sworld,justadecadeor
8
“Farbetteranapproximateanswertotherightquestion,whichisoftenvague,thananexactanswertothe
wrongquestion,whichcanalwaysbemadeprecise.”—JohnTukey,1962,Thefutureofdataanalysis.Annalsof
MathematicalStatistics,33,1–67(quotedonpp.13-14).
9
AfterGray’sdeath,AlexSzalayandcolleaguesformallyintroducedtheconceptofthe“FourthParadigm”,withhis
co-authoredpublicationinthejournalScience(Bell,etal.2009).
10
DuringGray’slastlectureonJanuary11,2007[http://research.microsoft.com/enus/um/people/gray/jimgraytalks.htm],heisquotedasstatingthefollowing:“Theworldofsciencehaschanged,
andthereisnoquestionaboutthis.Thenewmodelisforthedatatobecapturedbyinstrumentsorgeneratedby
simulationsbeforebeingprocessedbysoftwareandfortheresultinginformationorknowledgetobestoredin
computers.Scientistsonlygettolookattheirdatafairlylateinthispipeline.Thetechniquesandtechnologiesfor
suchdata-intensivesciencearesodifferentthatitisworthdistinguishingdata-intensivesciencefrom
computationalscienceasanew,fourthparadigmforscientificexploration.”(Hey,etal.,2009).
RENCIWhitePaperSeries,Vol.3,No.6 13
twolater,wherenearlyeveryscientificdomainismovingtowarddata-intensiveanddataenabledscientificdiscovery.Thisshiftinvolvestraditionalfieldssuchasphysicsandastronomy,
whichhavealwayshadaccesstorichdatasourcesbutarenowfacingunprecedented
computationalandanalyticalchallenges,andemergentfieldssuchasenvironmentalscience,
whichhasonlyrecentlyhadaccesstothewealthofdistributeddataavailablefrom
environmentalsensorsandsatellites.Eventhe“soft”sciencessuchassocialscienceand
politicalsciencearerecognizingthepowerofdataderivedfromsocialmediadata,mobile
devices,and“smartcities”.Moreover,appliedfieldssuchasmedicineareembracingtheFourth
Paradigmandthesemyriadnewdatasources.Forexample,in2011,theNationalResearch
Councilorganizedanadhoccommitteetodevelopaframeworkforaunifiedtaxonomyof
humandiseasethat,whenimplemented,willaccelerateprogresstowardprecisionmedicine,
whichisanewareaofmedicinethataimstopersonalizemedicinethroughtheintegrationof
dataonindividualvariabilityingenes,environment,andlifestyleinordertoidentifythemost
appropriatemedicalmonitoringandtreatmentplanforanygivenpatient(NationalResearch
Council2011).Thecommittee’sworklargelymotivatedPresidentObama’sJanuary2015
announcementduringtheStateoftheUnionAddressonanewNationalInitiativeinPrecision
Medicine(Collins&Varmus2015).
AFrameworkforScientificDiscoveryintheEraofBigData:TheCollaborative
KnowledgeNetwork
Thetimeisrighttowhole-heartedlyembracetheflexibilityandpowerthatnewapproachesto
scientificdiscoveryoffer,whilemaintainingthepowerthatthetraditionalscientificmethod
holds.
Thescientificapproachespresentedinthiswhitepaperareincreasinglybeingadoptedby
scientists;however,thereremainshesitancyabouttheirrolesandlegitimacyinthescientific
process,alackoftraininginandincentivesfortheiruse,andtheneedforadditionalresearchto
tailorandimprovetheseapproachesforscientificdiscovery.Wearguethataprimarychallenge
inmovingtowardsacombinationofhypothesis-anddata-drivenscienceistheintegrationof
approachesatacommunitylevel.
TheNationalResearchCouncil’sframeworktocreateapathtowardpersonalizedmedicine
includestheconceptofaKnowledgeNetworkofDiseaseasafederateddiscoverynetwork
organizedaroundanInformationCommonsofstructuredpatient-centricdatabasedlargelyon
genomicsandincludingpatientmedicalhistory,currentsignsandsymptoms,etc.(National
ResearchCouncil2011).Wesuggestasimilar,butbroaderframeworkforaCollaborative
KnowledgeNetworkforScientificDiscovery(Figure1).Theenvisionednetworkwouldbuild
upontheknowledgebasesthatarealreadybeingdeveloped,butthroughanopen,communitybasedeffortdesignedtolinktheseknowledgebasesandtherebycreateaknowledgenetwork
forscientificdiscovery.Theeffortwouldinvolvenotonlyscientificexperts,butallstakeholders,
includingpolicymakers,industryrepresentatives,andevenordinarycitizens,therebyenabling
nontraditionaldatasourcesanddatatypestodrivescientificdiscovery.Indeed,scientiststoday
haveaccesstoanabundanceofnewdatasourcesanddatatypes,whichwe’veclassifiedas:(1)
RENCIWhitePaperSeries,Vol.3,No.6 14
archetypalorvisiblebigdatasuchasthelarge,well-curated,publicdatasetsavailablefrom
NASAandotherlarge-scaleresearchinitiatives;(2)crowd-sourcedorsupernovabigdatasuch
asthemoment-to-momentdatasetsavailablefromTwitterandtheInternetofThings;and(3)
Figure1.ConceptualframeworkfortheproposedCollaborativeKnowledgeNetworkfor
ScientificDiscovery.
long-tailordarkbigdatasuchasthosehard-to-finddatasetsheldbysmallscientificteamsand
amateurorcitizenscientists.EffortssuchasWikipediaprovideamodelforopen,collaborative,
community-governedknowledgeaccumulation(Cohen2014).Acommunity-drivenand
community-governedknowledgenetworkcouldrevolutionizescientificdiscoveryandmove
scienceindirectionsnotimaginabletoday.
ClosingConsiderations
TheCollaborativeKnowledgeNetworkforScientificDiscovery,asproposedandenvisioned
herein,willrequirenewscientificmethods,approaches,andpoliciesinordertofullyrealizeits
potential.Forexample,issuesrelatedtodataprovenancewillneedtobeaddressed,aswill
issuesrelatedtofundingforongoingdevelopmentandlong-termsustainability.Theintegration
ofknowledgeassertionsgeneratedthroughdifferentscientificapproachesthathavediffering
levelsofcertaintyandexpressivenessremainsafundamentalchallenge,butonewhere
progressisbeingmade.Thesearenottrivialissues,asthecreationofnewknowledgefrom
collectiveknowledge,particularlywhenautomated,yieldscomplicatedissuesregarding
intellectualpropertyandownership.Withmultiplestakeholdersinvolvedintheenvisioned
RENCIWhitePaperSeries,Vol.3,No.6 15
knowledgenetwork(e.g.,academics,industry,government,thepublic),ownershipissueswill
needtoberesolved.Ongoingdevelopmentandlong-termsustainabilitybringrelated
challengesthatlikelywillrequiresignificantup-frontcommunitybuy-in.Complexmodelsfor
provenanceandsustainabilitymayneedtobedevelopedtoaccountforconflictinginterests.
Furthermore,today’sstudentsandworkersaren’tbeingtrainedfortheFourthParadigmof
scienceandoftenlackthecriticalthinkingskillsrequiredtoobjectivelynavigatethroughthe
abundanceofdataavailabletodayandmakemeaningfulinferencesabouttheworldaround
them.Thisgapintrainingandworkforcedevelopmentwillneedtobeaddressedinorderto
ensurethataknowledgenetworkisnotmisused.
Perhapsthemostimportantconsideration,however,isthedevelopmentofnewapproachesto
enablethecriticalevaluationandadjudicationofscientificfindingsinthemosttraditional
sense.For,intheabsenceofvalidautomatedmethodsfordrawingconclusionsregarding
scientific“truths,”scientistswillcontinuetofaceadatadelugewithoutarationalpathtoward
peerconsensus,erroneousknowledgemaybeintroducedandpropagated,andthelaypublic
willlosetrustinscience.Indeed,thepublicalreadyhasatendencytomistrustscience,
especiallywhenscientificfindingsgoagainstintuitionorpersonalbelief(Achenbach2015).
Withoutafacilemethodtodeterminethelegitimacyofscientificfindings,anypublicmistrustin
sciencewillonlyincrease.Intheabsenceofpublictrust,scienceitselfwillsuffer.
Howtoreferencethispaper:
Schmitt,C.,Cox,S.,Fecho,K.,Idaszak,R.,Lander,H.,Rajasekar,A.andThakur,S.(2015):
ScientificDiscoveryintheEraofBigData:MorethantheScientificMethod.RENCI,Universityof
NorthCarolinaatChapelHill.Text.http://dx.doi.org/10.7921/G0C82763
RENCIWhitePaperSeries,Vol.3,No.6 16
References*
Achenbach,J.(2015).Whydomanyreasonablepeopledoubtscience?NationalGeographic.
http://ngm.nationalgeographic.com/2015/03/science-doubters/achenbach-text.
Achilleos,K.G.,Kannas,C.C.,Nicolaou,C.A.,Pattichis,C.S.,&Promponas,V.J.(2012).Open
sourceworkflowsystemsinlifesciencesinformatics.Proceedingsofthe2012IEEE12th
InternationalConferenceonBioinformatics&Bioengineering(BIBE).Larnaca,Cyprus.
Alemida,P.(2013).Theoriginsandpurposeofscientificpublications.JEPSBulletin.
http://blog.efpsa.org/2013/04/30/the-origins-of-scientific-publishing/.
AmericanChemicalSocietyPublications.(2013).Befoundorperish:writingscientific
manuscriptsforthedigitalage.BIOJOURNALS.AmericanChemicalSociety.
http://pubs.acs.org/bio/ACS-Guide-Writing-Manuscripts-for-the-Digital-Age.pdf.
Behrens,J.T.(1997).Principlesandproceduresofexploratorydataanalysis.PsycholMethods.,
2(2),131–160.
Bell,G.,Hey,T,&Szalay,A.(2009).BeyondtheDataDeluge,Science,323(5919),1297-1298.
Berman,H.M.,Gabanyi,M.J.,Groom,C.R.,Johnson,J.E.,Murshudov,G.N.,Nicholls,R.A.,
Reddy,V.,Schwede,T.,Zimmerman,M.D.,Westbrook,J.,Minor,W.(2015).Datato
knowledge:howtogetmeaningfromyourresult.IUCrJ.,2(Part1),45–58.
Bertolucci,J.(2013).Bigdata’snewbuzzword:datafication.InformationWeek.
http://www.informationweek.com/big-data/big-data-analytics/big-datas-new-buzzworddatafication/d/d-id/1108797?.
Betz,F.(2011),OriginofScientificMethod.InManagingScience:Methodologyand
OrganizationofResearch(Innovation,Technology,andKnowledgeManagementSeries,No.9),
byF.Betz.SpringerScience+BusinessMedia.
Bittrich,M.E.(2000).BiographyofJohnWilderTukey.AT&TBellLaboratoriesCommunication.
http://cm.belllabs.com/cm/stat/tukey/bio.html.
Bordes,A.,Weston,J.,Collobert,R.,Bengio,Y.(2009).Learningstructuredembeddingsof
knowledgebases.AssociationfortheAdvancementofArtificialIntelligence,301-306.
http://www.thespermwhale.com/jaseweston/papers/AAAI11.pdf.
Bowers,S.(2012).Scientificworkflow,provenance,anddatamodelingchallengesand
approaches.J.DataSemant.,1(1),19–30.
RENCIWhitePaperSeries,Vol.3,No.6 17
Brillinger,D.R.(2002).JohnW.Tukey:hislifeandprofessionalcontributions.AnnalsStatist.,30
(6),1535–1575.
Burkhardt,F.(1996).CharlesDarwin’sLetters.ASelection.CambridgeUniversityPress.
Buytaert,W.,Baez,S.,Bustamante,M.,Dewulf,A.(2012).Web-basedenvironmental
simulation:bridgingthegapbetweenscientificmodelinganddecision-making.EnvironScience
Technol.,46,1971–1976.
CohenN.Wikipediavs.theSmallScreen.TheNewYorkTimes.February9,2014.
http://www.nytimes.com/2014/02/10/technology/wikipedia-vs-the-small-screen.html?_r=2m.
ColdSpringHarborLaboratory.(2014).Guidetoopenaccess.
http://cshl.libguides.com/content.php?pid=222607&sid=1847688.
Collins,F.S.,Varmus,H.(2015).Anewinitiativeonprecisionmedicine.NEJM,372(9),793–795.
http://www.nejm.org/doi/pdf/10.1056/NEJMp1500523.
Curcin,V.,Ghanem,M.(2008)Scientificworkflowsystems–canonesizefitall?Proceedingsof
the2008IEEE,CIBEC.IEEE.
Diaconis,P.(2011).TheoriesofDataAnalysis:frommagicalthinkingthroughclassicalstatistics.
In:ExploringDataTables,Trends,andShapes,byD.C.Hoaglin,F.Mosteller,andJ.W.Tukey.
JohnWiley&Sons,2011.
Diemer,J.,Alpers,G.W.,Peperkorn,H.M.,Shiban,Y.,Mühlberger,A.(2015).Perceptionand
presenceonemotionalreactions:areviewofresearchinvirtualreality.FrontPsychol.,6,Article
26.
Evans,J.,Wilhelmsen,K.,Berg,J.,Schmitt,C.,Krishnamurthy,A.,Fecho,K.,&Ahalt,S.(2015).
Clinicalgenomics:howmuchdataisenough?.RENCI,UniversityofNorthCarolinaatChapelHill.
Text.doi:10.7921/G0F769G9.http://renci.org/wp-content/uploads/2015/02/0215WhitePaper-CLinicalGenomics-highres.pdf.
Fayyad,U.,Piatetsky-Shapiro,G.,Smyth,P.(1996).Fromdataminingtoknowledgediscoveryin
databases.AIMagazine,Fall1996,37–54.
Gelman,A.(2004).Exploratorydataanalysisforcomplexmodels.J.ComputationalGraphic
Statist.,13(4),755–779.
Gower,B.(1997).Scientificmethod:anhistoricalandphilosophicalintroduction.London,
England:Routledge.
Guo,P.(2013).Datascienceworkflow:overviewandchallenges.CommunicationsoftheACM
blog.http://cacm.acm.org/blogs/blog-cacm/169199-data-science-workflow-overview-andchallenges/fulltext.
RENCIWhitePaperSeries,Vol.3,No.6 18
HeyT,TansleyS,TolleK(Eds.)TheFourthParadigm.Data-IntensiveScientificDiscovery.Seattle,
Washington:MicrosoftCorporation;2009.
Holzinger,A.,Dehmer,M.,Jurisica,I.(2014).Knowledgediscoveryandinteractivedatamining
inbioinformatics–state-of-the-art,futurechallengesandresearchdirections.BMC
Bioinformatics,15(Suppl6),1.
Jarke,M.,Neumann,B.,Vassiliou,Y.,Wahlster,W.(1978).KBMSrequirementsofknowledgebasedsystems.In:Schmidt,J.W.,&Thanos,C.(eds.)FoundationsofKnowledgeBase
Management.ContributionsfromLogic,Databases,andArtificialIntelligence.Berlin,Germany:
Springer,pp.391-395.
http://www.dfki.de/wwdata/Publications/KBMS_Requirements_of_KnowledgeBased_Systems.pdf.
Joppa,L.N.,McInerny,G.,Harper,R.,Salido,L.,Takeda,K.,O’Hara,K.,Gavaghan,D.,Emmo,S.
(2013).Troublingtrendsinscientificsoftwareuse.Science,340(6134),814–815.
Kunkler,K.(2006).Theroleofmedicalsimulation:anoverview.TheInternationalJournalof
MedicalRoboticsandComputerAssistedSurgery,2,203-210.
http://onlinelibrary.wiley.com/doi/10.1002/rcs.101/pdf.
Leonhardt,D.(2000).JohnTukey,85,statistician;coinedtheword‘software’.TheNewYork
Times.http://www.nytimes.com/2000/07/28/us/john-tukey-85-statistician-coined-the-wordsoftware.html.
Manyika,J.,Chu,M.,Bisson,P.,Woetzel,J.,Dobbs,R.,Bughin,J.,Aharon,D.(2015).The
InternetofThings:mappingthevaluebeyondthehype.McKinsey&Company.
http://www.mckinsey.com/~/media/McKinsey/dotcom/Insights/Business%20Technology/Unlo
cking%20the%20potential%20of%20the%20Internet%20of%20Things/Unlocking_the_potential
_of_the_Internet_of_Things_Executive_summary.ashx.
McKie,T.HowDarwinwontheevolutionrace.TheGuardian.June21,2008.
http://www.theguardian.com/science/2008/jun/22/darwinbicentenary.evolution.
MiH,MuruganujanA,CasagrandeJT,ThomasPD.(2013).Large-scalegenefunctionanalysis
withthePANTHERclassificationsystem.NatProtoc.,8(8),1551-1566
Montgomery,S.(2009).CharlesDarwin&evolution,1809~2009.Naturalselection.
Cambridge,UK:Christ’sCollege,UniversityofCambridge.
http://darwin200.christs.cam.ac.uk/pages/index.php?page_id=d3.
NationalResearchCouncil.(2011).TowardPrecisionMedicine.BuildingaKnowledgeNetwork
forBiomedicalResearchandaNewTaxonomyofDisease.Washington,D.C.:TheNational
AademiesPress.ContractNo.N01-0D-4-2139.http://www.nap.edu/catalog/13284/towardprecision-medicine-building-a-knowledge-network-for-biomedical-research.
RENCIWhitePaperSeries,Vol.3,No.6 19
Nickel,M.,Murphy,K.,Tresp,V.,Gabrilovich,E.(2015).Areviewofrelationalmachinelearning
forknowledgegraphs.Frommulti-relationallinkpredictiontoautomatedknowledgegraph
construction.CornellUniversityLibrary:arXiv:1503.00759.http://arxiv.org/abs/1503.00759.
Perra,N.,Goçalves,B.,Pastor-Satorras,R.,Vespignani,A.(2012).Activitydrivenmodelingof
timevaryingnetworks.ScientificReports,2,469,1.
Perraud,J.M.,Bai,Q.,Hebir,D.(2010).Ontheappropriategranularityofactivitiesinascientific
workflowappliedtoanoptimizationproblem.InternationalEnvironmentalModellingand
SoftwareSociety(iEMSs)2010InternationalCongressonEnvironmentalModellingandSoftware
ModellingforEnvironment’sSake,FifthBiennialMeeting.Ottawa,Canada.
Pugh,K.,Prusak,L.Designingeffectiveknowledgenetworks.MITSloanManagementReview.
September12.2013.http://sloanreview.mit.edu/article/designing-effective-knowledgenetworks/.
Reshef,D.N,Reshef,Y.A.,Finucane,H.K.,Grossman,S.R.,McVean,G.,Turnbaugh,P.J.,
Lander,E.S.,Mitzenmacher,M.,Sabeti,P.C.(2011).Detectingnovelassociationsinlarge
datasets.Science,334(6062),1518-1524.
Samuelson,P.A.(1954).Thepuretheoryofpublicexpenditure.TheReviewofEconomicsand
Statistics,36(4),387–389.
http://www.ses.unam.mx/docencia/2007II/Lecturas/Mod3_Samuelson.pdf.
Singh,M.P.,Vouk,M.A.(undated).Scientificworkflows:scientificcomputingmeets
transactionalworkflows.
http://www.csc.ncsu.edu/faculty/mpsingh/papers/databases/workflows/sciworkflows.html.
Smith,D.F.(2014).Abriefhistoryofgamification.EDTECH,FocusonHigherEducation.
http://www.edtechmagazine.com/higher/article/2014/07/brief-history-gamificationinfographic.
Smith,R.(2006).Peerreview:aflawedprocessattheheartofscienceandjournals.JRoyal
SocietyMed.,99(4),178–182.
Southern,M.Googleislookingintowaystoranksitesbasedonaccuracyofinformation.Search
EngineJournal.March4,2015.http://www.searchenginejournal.com/google-is-looking-intoways-to-rank-sites-based-on-accuracy-of-information/127394/#.
Spangler,B.(2003).Adjudication.BeyondIntractability.
http://www.beyondintractability.org/essay/adjudication.
Speed,T.(2011).Mathematics.Acorrelationforthe21stcentury.Science,334(6062),15021503.
RENCIWhitePaperSeries,Vol.3,No.6 20
Trader,T.(2012).Tamingthelongtailofscience.HPCWire.
http://www.hpcwire.com/2012/10/15/taming_the_long_tail_of_science/.
Wilhelmsen,K.,Schmitt,C.&Fecho,K.(2013).Factorsinfluencingdataarchivaloflarge-scale
genomicdatasets:amathematicalformalismtocomprehensivelyevaluatethecosts-benefits
ofarchivinglargedatasets.RENCITechnicalReportSeries,TR-13-03.UniversityofNorth
CarolinaatChapelHill,ChapelHill,NC,USA.doi:10.7921/G0MW2F25.
http://renci.org/technical-reports/factors-influencing-data-archival-of-large-scale-genomicdata-sets/.
Zyda,M.(2005).Fromvisualsimulationtovirtualrealitytogames.Computer,38(9),25–32.
*AllhyperlinkswerelastaccessedonSeptember4,2015.
RENCIWhitePaperSeries,Vol.3,No.6 21