Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
REDUCING BARRIERS AND INCREASING UPTAKE OF TEXT AND DATA MINING FOR RESEARCH ENVIRONMENTSUSINGACOLLABORATIVEKNOWLEDGEANDOPENINFORMATIONAPPROACH DeliverableD3.1 ResearchReportonTDMLandscapein Europe ©2016FutureTDM|Horizon2020|GARRI-3-2014–665940 2 D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE Project Acronym: FutureTDM Title: Reducing Barriers and Increasing Uptake of Text and Data Mining for Research EnvironmentsusingaCollaborativeKnowledgeandOpenInformationApproach Coordinator: SYNYOGmbH Reference: 665940 Type: Collaborativeproject Programme: HORIZON2020 Theme: GARRI-3-2014-ScientificInformationintheDigitalAge:TextandDataMining(TDM) Start: 01September,2015 Duration: 24months Website: http://www.futuretdm.eu E-Mail: [email protected] Consortium: SYNYOGmbH,Research&DevelopmentDepartment,Austria,(SYNYO) StichtingLIBER,TheNetherlands,(LIBER) OpenKnowledge,UK,(OK/CM) RadboudUniversity,CentreforLanguageStudies,TheNetherlands,(RU) TheBritishLibraryBoard,UK,(BL) UniversiteitvanAmsterdam,Inst.forInformationLaw,TheNetherlands,(UVA) Athena Research and Innovation Centre in Information, Communication and Knowledge Technologies, Inst. for Language and Speech Processing, Greece, (ARC) UbiquityPressLimited,UK,(UP) FundacjaProjekt:Polska,Poland,(FPP) ©2016FutureTDM|Horizon2020|GARRI-3-2014|665940 3 D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE Deliverable Number: D3.1 Title: ResearchreportonTDMLandscapeinEurope Leadbeneficiary: RU Workpackage: WP3:ASSESS:Studies,Publications,LegalRegulations,PoliciesandBarriers Disseminationlevel: Public(PU) Nature: Report(RE) Duedate: 31.01.2016 Submissiondate: 29.04.2016 Authors: MariaEskevich,RU AntalvandenBosch,RU Contributors: MarcoCaspers,UVA LucieGuibault,UVA AlessioBertone,SYNYO SusanReilly,LIBER CarmenMunteanu,SYNYO PeterLeitner,SYNYO SteliosPiperidis,ARC Review: MarcoCaspers,UVA LucieGuibault,UVA AlessioBertone,SYNYO Acknowledgement: This project has received Disclaimer: The content of this publication is the funding from the European Union’s Horizon 2020 sole responsibility of the authors, and does not in Research and Innovation Programme under Grant any way represent the view of the European AgreementNo665940. Commissionoritsservices. ThisreportbyFutureTDMConsortiummemberscan bereusedundertheCC-BY4.0license (https://creativecommons.org/licenses/by/4.0/). ©2016FutureTDM|Horizon2020|GARRI-3-2014|665940 4 D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE TableofContents 1 Introduction....................................................................................................................................6 1.1.Whatistextanddatamining?.....................................................................................................6 1.2.ThegrowingimportanceofTDM.................................................................................................7 1.3.Goalandstructureofthereport..................................................................................................9 2 TDMframework............................................................................................................................11 2.1.Datatobemined........................................................................................................................11 2.2.TechnologyandR&D..................................................................................................................16 3 TDMacrosseconomicsectors.......................................................................................................21 3.1.Primarysector............................................................................................................................22 3.2.Secondarysector........................................................................................................................22 3.3.Tertiarysector............................................................................................................................23 3.4.Quaternarysector......................................................................................................................25 3.5.Quinarysector............................................................................................................................26 4 Summaryoftheliterature.............................................................................................................27 5 Conclusion.....................................................................................................................................40 References.............................................................................................................................................41 ©2016FutureTDM|Horizon2020|GARRI-3-2014|665940 5 D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE LISTOFFIGURES Figure1.NumberofdifferenttypesofpublicationsintheDBLPComputerSciencebibliography........8 Figure2.NumberofdifferenttypesofpublicationsontheEuropeanLifeSciencesPMCwebsite........9 Figure3.MacroviewofTDMframework.............................................................................................11 Figure4.MicroviewonTDMenvironment..........................................................................................17 Figure5.GeneraleconomicstructureandconnectionbetweenTDMandalleconomicsectors........22 Figure6.Timelineofreports.................................................................................................................29 ©2016FutureTDM|Horizon2020|GARRI-3-2014|665940 6 D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE 1.INTRODUCTION Recentyearshaveseenanexponentialgrowthinthevolumeofdigitaldataavailabletoalllevelsof society.Afurtherproliferationofdigitalordigitisedcontentisexpectedinthis,theeraofBigData.A numberofinterrelatedfactorshaveboostedthisdatathrustandtheaccompanyingdevelopmentof technologiestoexploitthisnewtypeofinformation-richgroundmaterialwithtextanddatamining (TDM). First, the costs for data storage and cloud computing facilities continue to decrease, while computing power increases. Second, digitisation has moved from specific processes, such as archiving and optimizing industrial workflows, to nearly all daily aspects of societal activities. This processpavesthewayfordataficationofthisinformation,i.e.thecreationofnewinformationand knowledge, potentially with new value, on the basis of machine-readable content. TDM is the umbrellatermfortechnologiesthatenablethisgoal. Thecontinuousgrowthofdata1increasinglyemphasisesthequestionofhowtobestmakeuseofit. Societyexpects,withgoodreason,thattheimplementationofintelligentmethodstodealwithBIg Datawillleadtoimprovedinformationaccess,fasterknowledgediscovery,higherproductivity,and competitiveness.TheseexpectationshaveinspiredTDMresearchersandcompaniesacrosstheworld to explore novel algorithms for extracting information, discovering knowledge, and improving the ability to predict from data on a large scale. As data is a manifold concept, so is TDM. This report aimstochartandstructuretheTDMlandscape. 1.1.WhatisTextandDataMining? Text and Data Mining was initially defined as “the discovery by computer of new, previously unknown information, by automatically extracting and relating information from different (…) resources,torevealotherwisehiddenmeanings”(Hearst,1999),inotherwords,“anexploratorydata analysisthatleadstothediscoveryofheretoforeunknowninformation,ortoanswersforquestions for which the answer is not currently known” (Hearst, 1999). Nowadays, this umbrella term encompassesdiversetechniquesthatallowinterpretationofcontentofanytyperangingfromraw data, e.g. sensor data, text, images and multimedia, to processed content, e.g. diagrams, charts, tables,references,maps,formulas,chemicalstructures,andmetadatafromsemi-structuredsources, onalargescalethroughtheidentificationofpatterns. TDM algorithms harbour vast potential for nearly all scientific fields and for a wide diversity of practical industrial and societal applications. However, this broad potential and variety of TDM applications complicates the landscape view, as even the technology itself is referred to using different terms within different fields. Within the business and managerial context, TDM can be referred to as Business Intelligence Solutions or Qualitative Data Analysis. To highlight the central 1 Itisexpectedthattherewillbe16trilliongigabytesofdataby2020,whichcorrespondstoanannualgrowth rate of 236 % in data generation (Motion for a resolution further to Question for Oral Answer B8-0116/2016 pursuant to Rule 128(5) of the Rules of Procedure on “Towards a thriving data-driven economy” (2015/2612(RSP),2016): http://www.europarl.europa.eu/sides/getDoc.do?type=MOTION&reference=B8-2016-0308&language=EN ©2016FutureTDM|Horizon2020|GARRI-3-2014|665940 7 D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE positionofvastquantitiesofdata,thetermBigData(Processing)isused.Whenthefocusisonthe discovery of hitherto undiscovered information, Content Mining appears to be a term of choice, whileDataAnalyticsandTextAnalyticsemphasisethedata-drivenaspectofperforminganalysesand afocusonseekinganalyticalsolutionstochallenges.Intheacademicworld,TDMisoftenconsidered tobecomposedofMachineLearningmethodscoupledwithExploratoryDataAnalysismethodssuch as visualisation. Machine Learning, in turn, arose from the post-war fields of Artificial Intelligence, InformationTheory,andPatternRecognition. 1.2.TheGrowingImportanceofTDM Withtheincreaseofcomputationalpower,theriseofBigData,andthedigitisationofvastamounts ofdata,theuseofTDMpromisestoyieldvaluablesocietalandeconomicbenefitsintermsofnew insights and cost savings that would otherwise not be possible: more scientific breakthroughs, a greater understanding of society, and countless business opportunities. The potential of TDM as a researchtechniqueandasaneconomicopportunityhasbeenrecognisedatapoliticallevelintheEU (HargreavesI.etal.,2014).2ThebenefitsofTDMhavealsobeennotedbytheresearchcommunity andsocietalstakeholderse.g.viatheHagueDeclaration3. Originallyrootedinacademicresearch,TDMhasleftitsacademicchildhoodyearsandisnowaglobal informationtechnology(IT)industry.Asanindustry,TDMisfuelledbydirectaccesstolargeamounts ofdata.CompaniescaneitherdevelopandruntheirownTDMsoftware,oroutsourcethisworkto external experts who can customise TDM solutions. Industry is now rapidly rolling out the implementation of TDM for Big Data, as computing power grows and businesses find new ways to gather and access data. According to the International Data Corporation’s Worldwide Big Data TechnologyandServicesForecastfor2013-2017,thegrowthrateoftheBigDatamarketuntil2017 willbesixtimesfasterthanthatoftheoverallinformationandcommunication(ICT)market.TheBig DataValuePublic-PrivatePartnership2predictsthatTDMwillreachatotalofEUR50billionby2017, andmayresultin3.75millionnewjobs.WhilemanycompaniesseethepotentialofTDMfortheir business development, only 1.7 % of companies are currently making full use of advanced digital technologieslikeTDM,despitethebenefitsthatdigitaltoolscanbring.2 TheapplicationofTDMtoscienceholdsthepromiseofacceleratingscientificdiscoveries.Miningthe content present in research publication databases and the associated research data sets, as far as they are available for harvesting, will boost scientific research by increasing the productivity of researchers and leaving part of the scientific discovery process to computers, a trend called Data Science. The size of scientific publication databases varies across domains, but in each case, the number is beyond a value that would be feasible for a human to read and assess without digital support. Figures 1-2 give examples of the volume of content that has to be dealt with by the 2 (Motion for a resolution further to Question for Oral Answer B8-0116/2016 pursuant to Rule 128(5) of the Rules of Procedure on “Towards a thriving data-driven economy” (2015/2612(RSP), 2016): http://www.europarl.europa.eu/sides/getDoc.do?type=MOTION&reference=B8-2016-0308&language=EN; http://ec.europa.eu/research/innovation-union/pdf/TDM-report_from_the_expert_group-042014.pdf 3 TheHagueDeclarationonKnowledgeDiscovery: http://thehaguedeclaration.com/the-hague-declaration-on-knowledge-discovery-in-the-digital-age/ ©2016FutureTDM|Horizon2020|GARRI-3-2014|665940 8 D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE academia and industry researchers in the Computer Science4 and Life Sciences domains5, respectively. TDM may help researchers in all scientific fields to regain control over the academic process,toaggregateinformationandtofollowtrendsacrossfieldsbeyondtheirown.Estimationsof addedeconomicvalueinthiscase,basedonthetimesavingbenefitofTDMalone,areanincreaseof 2percent,orEUR5.3billion,withlong-termimpactofoverallgainintherangeofatleastEUR32.5 billion(Filippov,2014). Figure1.NumberofdifferenttypesofpublicationsintheDBLPComputerSciencebibliography.6 4 http://dblp2.uni-trier.de https://europepmc.org 6 http://dblp2.uni-trier.de/statistics/publicationsperyear.html[accessedonMarch2016] 5 ©2016FutureTDM|Horizon2020|GARRI-3-2014|665940 9 D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE Patents 4.2million NHSguidelines 765 Abstracts 30.9million, 25.9million fromPubMed Fulltextarticles 3.6million Agricolarecords 617244 Figure2.NumberofdifferenttypesofpublicationsontheEuropeanLifeSciencesPMCwebsite.7 1.3.Goalandstructureofthereport Inthisreport,weoutlinethetextanddatamininglandscape.Westartbyprovidinganoverviewof types of data and technologies that the TDM experts are working with. We look at the global economicstructureoftheknowledge-basedsocietyandhighlightusecasesforTDMuptakeacrossall economic sectors. Following recent discussions on TDM perspectives within the scientific, political, andeconomiccontext,wehighlighttheimportanceofthetechnologyanditspotential. TDM algorithms are developed globally. It is therefore our task to list and structure the important technology definitions, data types, underlying algorithms and leading research communities on a worldwide scale. However, the actual uptake of TDM and ease of practical implementation vary considerablydependingonthecountryandtheeconomicsectorinquestion.Inthisreport,wegivea broad perspective on the general European Union (EU) landscape, and quote exceptional countryleveldirectivesonlywhenthesecountrieshavelegalregulationsspecificallyaddressingTDM,e.g.the caseofstatutoryexceptiononcopyrightintheUK8. TDM is becoming increasingly ubiquitous on the global market, as each company that tracks its activities,products,andcommunicationsinsomedigitalformhasthepotentialtoreusethisdatafor furtheranalysisandimprovement.Forsomesectors,countriesandapplications,itisalreadypossible to estimate the potential profit or at least the size of the market. Other fields such as journalism, 7 8 https://europepmc.org/About[accessedonMarch2016] "Section 29A and Schedule 2(2)1D of the Copyright, Designs and Patents Act 1988": http://www.legislation.gov.uk/uksi/2014/1372/regulation/3/made ©2016FutureTDM|Horizon2020|GARRI-3-2014|665940 10 D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE which is in the process of embracing the concept of Data Journalism, have to first agree on the conceptualchangesrequiredtotheircurrentworkbehaviourinordertomergethetraditionalwork patternwithTDM(LewisS.C.andWestlundO.,2014). This report is structured as follows: in Section 2, we introduce the definitions of the overall TDM framework, classifications of data types, and the technologies that can be used in text and data miningprocess.InSection3,weshowhowTDMispotentiallyrelatedtoallsectorsofeconomics,and we illustrate this assumption with examples and potential use cases of TDM implementation and integration across sectors. In Section 4, we draw a timeline of scientific, industrial and political documents that emphasise the importance of and interest in TDM and the need for uptake across sectors,andwesummarisethekeymessages.WeconcludethereportwithSection5,outliningthe importanceofTDMforthefutureoftheeconomyandscience. ©2016FutureTDM|Horizon2020|GARRI-3-2014|665940 11 D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE 2.TDMFRAMEWORK Thequantityofdigitalcontentcurrentlyproducedbysocietyandthepotentialeaseofaccesstothis dataatanylocationviaeasy-to-usedevices(withreliableInternetconnection,andmobiledevices) increasetheneedforcontentminingtechnologiesthatboosthumancapacitiesintermsofspeedof dataprocessing(Mayer-SchoenbergerandCukier,2013).TDMtechnologiescansupportindividuals aswellaslargerinstitutionsandcompaniesindecision-makingprocessesandinthedevelopmentof novel ideas and solutions. In this section we discuss diverse data classifications, technologies that underlie TDM implementations, and the research communities that bring together the researchers anddevelopersfrombothacademiaandindustrytoimproveandadvanceTDMalgorithms. 2.1.Datatobemined TDMisnotdefinedthroughasinglescientificdomainortheory;itisacomplexsetoftechnologies with different underlying theories and assumptions, consisting of components that are developed withindiversedisciplines.AsFigure3shows,atamacrolevel,theTDMframeworkiscentredaround data.AsweentertheBigDataera(Mayer-SchoenbergerandCukier,2013),theefficientstorageand managementofcontentrepresentasignificantchallengeoftheirown,bothintermsoftheamount ofinformationandintermsoftheheterogeneousnatureofthedata,whichcanbenumeric,textual, multimodal,etc.Themorecomplexthedata,themorecreativeandinnovativetheTDMcomponents have to be. Data can be structured according to its size and legal accessibility, according to its sources,orbywhoproducesitthecontentthatconstitutesthedatacollections. InSection2.1.1,weprovideanoverviewofthetypesofsourcesthatcangenerateandholddata,to illustratethebreadthofthesourcesonwhichTDMmayoperate.InSection2.1.2,wesummarisea newlyproposedclassificationofdatasetsaccordingtodatasizeandtypesoflegalaccess,inorderto identifythetypesofdatasetsthatcouldbemoreeasilyaccessedandminedbyfor-profitandnonprofitorganisations,inthecasethatcopyrightrestrictionsareadjusted. Figure3.MacroviewofTDMframework.9 9 ThisFigure(‘MacroviewonTDMenvironment’)byFutureTDMcanbereusedundertheCC-BY4.0license. ©2016FutureTDM|Horizon2020|GARRI-3-2014|665940 12 D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE 2.1.1.Datasourcesanddataholders Datacomesincountlessforms.Itiseasytothinkofdataasintentionallycollectedtextorscientific data,butdatamaybealsoproducedbyhumansandbusinessesastheycarryoutdatafiedactivities, leaving so-called “data exhaust” as a by-product of their actions and movements in the world, e.g. users’ online interactions, corporate and personal websites, and released digital documents and archives (Mayer-Schoenberger and Cukier, 2013). Another type of data is represented by measurements of uncontrolled events happening in nature as recorded by weather sensors, telescopes,andotherinstrumentsacrosstheglobe. Belowwelistdifferentdatasources,rangingfrompersonaldata,createdortrackedfromindividuals, to professionally gathered and created content in forms of cultural knowledge data, scientific and businessdata,publicsectorinformation.AllthisdatacanbeeithersharedpubliclyontheInternet, andstoredoffline.Whensharedpublicly,theoriginalcontentisenrichedwithadatadescriptionand furtherinformationthatareavailableonawebsite. Personaldata ThetracesthateachInternetuserleavesformamassiveamountofinformationthathasvaluewhen gathered and structured on a mass scale. Currently this information is mostly given by individuals eitherwithorwithouttheirconsent;inthefirstcaseoftenwithoutthemrealisinghowprofitablethis information is, and in the second case, without them even realizing that their information is being collectedinanyway. States also collect information about individuals in the country, and can often compel people to providethemwithinformation,ratherthanhavingtopersuadethemtodosooroffersomethingin return, as private companies would have to. This special status allows a state to gather large amountsofinformationthatcanbepotentiallyminedbyprivateorpublicinstitutions.Thiscanthen beusedtooptimiseinteractionsbetweenthestateandindividuals,toimprovetheservicesprovided bythestate,andtocreatebusinessmodelsaroundthisdataortousethisknowledgeforthegreater goodofthesociety.Atthesametime,alongwiththeaccesstopersonaldata,statesareresponsible forprotectingtheprivacyofthedataproducerswhenminingthisdataandreleasingittothegeneral public. Culturalheritagedata There has been enormous investment in the digitisation of heritage data, i.e. data that are consideredworthwhilearchiving,studying,orexhibiting.In2010,itwasestimatedthatitwouldtake EUR10billiontodigitiseEurope’sculturalheritageholdings.3By2012,itwasestimatedthatatleast 20% of Europe’s cultural heritage collections had been digitised10. At the same time libraries were investing in preserving digital collections and exploring ways to increase the accessibility of this content. Libraries use TDM to structure digital content, and to develop and explore further possibilities for knowledge discovery e.g. via topic modelling. Improvements in the accuracy of 10 http://pro.europeana.eu/enumerate/statistics/results ©2016FutureTDM|Horizon2020|GARRI-3-2014|665940 13 D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE optical character recognition (OCR), including for handwritten texts11 will increase the impact and efficiency of applying TDM to digital content in order to enhance metadata or analyse collections. Google Books12 and Google Art13 operate on a worldwide scale, and digitise cultural (printed) heritage for their own technology development, as well as to make this data accessible to a wider audience.TheHathiTrustDigitalLibraryisapartnershipmodelbringingtogetherdigitisedcollections fromacrossUSinstitutionsandmakingthemavailabletoUSresearchersforanalysis.14AtaEuropean level, Europeana, which collects metadata about heterogeneous data such as digitised artworks, artefacts,books,videosandsounds,providesaccesstothismetadataandlimiteddatasetsviaAPIs15. The DARIAH16 and CLARIN17 infrastructures illustrate the value and potential of access to these extremely heterogeneous corpora for the research community. Both infrastructures support TDM acrossawide-rangingcorpora,e.g.CLARINsupportsnaturallanguageprocessingandanalysisoftext (includingnewspapers,webfora,books),andspeech(newsbroadcasts,signlanguage). Scientificdata Scientific data consist of two main types: data that have been collected through experiments and studies(measurementsofanykind,texts,videos,etc.),andthescientificpublicationsthatdescribe the work after analysis has been performed (containing data description, methods, experimental results,etc.). Data used in experiments may be released to a general audience, along with the publication describingthestudy.Thisjointreleaseofdata,withcorrespondingpublicationandrelatedmetadata, is considered to have a beneficial side-effect: it promotes scientific research reproducibility and scientific integrity. An increasing number of researchers and international research organisations perceivethistobeapositivestepforwards(ICSU,ISSC,TWAS,IAP,2015).18 Scientific data can be released purely as a data set collected for general purposes. In this way, researchers in business or academia can access the data and decide themselves on the type of experimentstorun.AnexampleofthisapproachistheopenaccessreleaseoftheEUsatellitedata fromCopernicusEarthobservation19.Theinitiativetoaskthequestionsandtoanswerthemliesin thehandsoftheaudience.OpendataanddatasharinginresearchisontheincreaseinEuropedue to funder-mandates such as the H2020 Open Data Pilot, and the increasing availability of infrastructurefordatasharinge.g.genericservicessuchasEUDAT20andZenodo21.Onaworldwide 11 http://transcriptorium.eu 12 13 https://books.google.com https://www.google.com/culturalinstitute/project/art-project https://sharc.hathitrust.org 15 http://labs.europeana.eu/api 16 https://de.dariah.eu/tatom 17 http://www.clarin.eu 18 FORCE11 MANIFESTO: Improving Future Research https://www.force11.org/about/manifesto 19 http://www.eea.europa.eu/highlights/eu-satellite-data-to-be 20 http://www.eudat.eu 14 ©2016FutureTDM|Horizon2020|GARRI-3-2014|665940 Communication and e-Scholarship: 14 D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE scale,anexampleistheResearchDataAlliance(RDA)22,aforumwheretheinfrastructuresfordata sharingacrossdisciplines,institutions,andcountries,arediscussed,andguidelinesareformulated.In addition,industryisalsobeginningtomaketheirscientificdatasetsavailable,e.g.clinicaltrialdata, oftenforethicalortransparencyreasons. Scientific publications can also be considered as a valuable dataset for TDM. These texts represent potentialsourcesofnewknowledgewhentheinformationtheycontainissummarisedandanalysed incomparisonwithotherstudiesandlinesofenquiryfromotherdisciplines.Oftencontainingfigures anddatavisualisations,articlesareofvaluenotjustbecauseofthetexttheycontain,butbecause they are important sources of raw data and context. TDM analysis of publications can help find connections between studies in different domains that are implicitly or indirectly connected (Swanson, 1986). In the field of bioinformatics, the literature has been mined to analyse the interactions of genes and proteins in various diseases. Often this type of analysis is performed on Medline23abstracts,astheyarefreelyavailablebutithasbeennotedthatthisisextremelylimiting. VanHaagenetal.foundthatonly32%ofknownProteinProteinInteractions(PPIs)couldbefoundin Medline abstracts, and therefore the rest could only be found by accessing the full text literature (vanHaagenetal.,2009). Scientific publications that are released via open access are, by definition, accessible for TDM and TDM-basedapplications.PubMedCentral24hasbecomeawidelyusedsourceforTDM.TheEuropean open access publication infrastructure OpenAire25 not only supports TDM by encouraging storing articles under open access licences and in interoperable machine readable formats, it also utilises TDM to infer links between data and publications, and to provide funders with data regarding publications arising from research funding and research trends26. Subscription content is less accessibleas,withtheexceptionoftheUK,Europeanresearcherscanonlyminethiscontentunder conditions specified by licences. In contrast to this, in the US, researchers may mine content that theyhavelegalaccessto,withouttheneedforadditionallicencepermissions. Businessdata Businessesarecreatingandcollectingdatafromallpossiblesources,asminingdatafromcombined sources improves insights into internal processes, strategies, and the market. These data can vary from energy consumption by business agents, such as shops and carriers, to personal information about customers and their digital and financial transactions. Most of this data constitutes value withinthecontextofaknowledge-basedeconomy,andissensitivetodataprotectionissues,thus,is usedinternallybycompaniesandisnotreleasedpublically. 21 https://zenodo.org https://rd-alliance.org 23 https://www.nlm.nih.gov/bsd/pmresources.html 22 24 http://www.ncbi.nlm.nih.gov/pmc https://www.openaire.eu 26 https://blogs.openaire.eu/?p=88 25 ©2016FutureTDM|Horizon2020|GARRI-3-2014|665940 15 D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE Publicsectorinformation(PSI) PSIisdataproducedandmadeavailablebygovernmentalinstitutions.Today,agrowingnumberof governmentsrealisethatthisdatacanbringmoreprofitandbenefittosocietywhenreleasedtoa general audience. PSI data release initiatives allow commercial companies that are already experienced in data mining, to use this additional information source to build better products for theircustomers,consequentlycreatingmorejobsandboostingcommercialsuccess. The European Union Open Data Portal27 gives access to a growing range of PSI data from the institutions and other bodies of the European Union (EU); it is free for both commercial or noncommercial purposes. The EU Open Data Portal is managed by the Publications Office of the EuropeanUnion.ImplementationoftheEU'sopendatapolicyistheresponsibilityoftheDirectorateGeneralforCommunicationsNetworks,ContentandTechnologyoftheEuropeanCommission. Open data portals have also been set up by individual EU Member States.28 The datasets of public documentsthattheseportalsshare,aremostlyunderCC-0license,i.e.theownersdidnotreserve anyrights,andthedataisfreeforsharinganduseinanycountry,byanyinstitutions.Thesedatasets representthedocumentsinthenativelanguagesofeachspecificcountry.29,30,31,32,33 Internetdata Sincetheintroductionofthefirstwebsitein1991,theWorldWideWebhadexpandedtomorethan 45 billion pages with unique URLs in the late 2010s (Bosch et al., 2016). Many webpages contain valuable information that can be crawled for further mining. More generally, the Internet (the infrastructureonwhichtheWorldWideWebisbased,butalsoservicessuchasinternet,email,and filetransfer)allowsaccesstolargeamountsofdataofalltypes,thatcanbecrawledwiththeproper credentials. Many data remains, of course, hidden behind passwords and firewalls. The so-called Deep Web, the part of the World Wide Web that is not directly accessible for indexing by search engines,hasbeenestimatedtoholdatleast500timesasmuchdataastheindexed(‘surface’)web.34 2.1.2.Datatypes AnytypeofdataconsideredtobeproperlyavailableforTDMmustfirstbeavailableas,orconverted to,astandardisedandmachine-readableformsothatitcanbeeasilyprocessed.Intheir2014report to the European Commission entitled 'Standardisation in the area of innovation and technological development, notably in the field of Text and Data Mining', Hargreaves et al.(2014) identified five 27 28 29 https://open-data.europa.eu/en/data http://ec.europa.eu/newsroom/dae/document.cfm?action=display&doc_id=8340 https://data.gov.uk 30 31 https://data.overheid.nl https://www.data.gouv.fr 32 https://www.govdata.de 33 http://data.norge.no 34 https://en.wikipedia.org/wiki/Deep_web ©2016FutureTDM|Horizon2020|GARRI-3-2014|665940 D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE typesofdatabasesthatlendthemselvestoTDM,accordingtotheirsizeandaccesspolicies,aswellas theirpotentialforrevenuestructurechanges:35 1.XXL:alldatabehindfirewalls,i.e.companiesandorganisations’internaldatabases,notaccessible tothepublic.36 2. XL: all publicly accessible data not behind a firewall or paywall. e.g. freely accessible newspaper websites. 3. L: all publicly accessible data located behind a paywall. e.g. online newspaper pages that are accessiblewithasubscriptionfee. 4 M: all publicly accessible data behind a paywall whose clients are mainly researchers and companiesinneedofreports.e.g.Reuters,Bloomberg. 5.S:scientificpublishers’databehindapaywall. Thisclassificationcoversdatacollectionsofdifferentsizesthatcontaincontentofadiversenature. Wecanmakeadistinctionbetweenhigh-leveldatathatislikelytobesubjecttocopyrightprotection, and low-level data that is highly unlikely to be protected under copyright law (although data protectionlawmayapplyincertaincases): - Protected:text,image,sound,multimedia; Non-protected,unlessthecontentbecomespartofthedatabasethatthecompanyor institutiondidputsubstantialinvestmentinto:e.g.clicks,likes,GPScoordinates,timestamps, andmeasurementsofadifferentkind. 2.2.TechnologyandR&D Textanddatamininghasgraduallyevolvedintoalargedomainwithcomplexapproachesforarange ofapplications.Historically,thedevelopmentofminingtechnologieshasitsrootsinthemid-1700s, withtheinventionofmathematicalmodels,regressionmethods,andstatisticalanalysis.Duetothe advent of computers and systems for storing larger quantities of data, the possibility to mine very largedatasetsandfilterrelevantinformationhasgreatlyexpandedwhatusedtobelaboriousexpert analysis work. Since the beginning of the 1990s, modern data mining tools have appeared on the market and in labs, allowing more advanced analysis, to discover hidden patterns and to extract implicit knowledge, as envisaged and tested by Swanson D.R. (Swanson, 1986). With the invention andadoptionoftheWorldWideWeb,researchersexpandedtheireffortstodeveloptextanddata miningmethodstosearchandexploretheweb.Thisledtothecreationofsearchengines,starting from simple interface-based browsers like Mosaic and Netscape, and developing into large-scale systems like Google and Bing. It represented the opening gambit towards more complex methods, not only based on keyword search, but also on their correlation, their similarity, their joint probabilities,andtotheapplicationoftextminingtodifferentfields(suchasthebiomedicaldomain, 35 Theoriginaldocumentusestheterm‘database’;wereplacedthisby‘data’. 36 ThesecollectionsareoutsidetheFutureTDMscope. ©2016FutureTDM|Horizon2020|GARRI-3-2014|665940 16 17 D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE economics, social sciences, or humanities), and types of data and content (such as scientific publicationsandresearchdata,socialmedia,multimedia,orpublicsectorinformation). Figure 4 shows a zoom-in view of the TDM environment, outlining the main research fields that develop,test,andfinallyprovidethealgorithmsforTDMimplementationsforeachspecificdomain of use. The core scientific areas are machine learning, which is machine learning of tasks and the discoveryofpatternsandstructureindata;informationretrievalandsearch,whichfocusesonthe optimisation of search methods that index vast amounts of content and allow the location and extraction of the most relevant facts for any question and user type; and natural language processing techniques, which deal with human language in its textual and spoken form, covering humanity’s past as well as reflecting current trends and topics. This scheme does not include an exhaustive list of techniques and research directions; its aim is to illustrate the scale and potential scientific interaction, and the broad knowledge required to create a successful, scientifically sound TDMapplication. Below,webrieflyintroducethemainresearchdirectionsthatconstitutethecomponentsoftextand dataminingapplications. Information Extraction Artificial Intelligence Concept Extraction Information Retrieval and Search Classification Clustering Opinion Mining Negation and Modality Detection Predictive Analytics Machine Learning Natural Language Processing Machine Translation Regression Multimedia Processing Association Rules Textual Entailment Summarisation Statistics Figure4.MicroviewonTDMenvironment.37 Machine Learning (ML) studies the automated recognition of patterns in data, and develops algorithmsabletolearntaskswithoutexplicit(human)instructionorprogramming.Learningisbased 37 ThisFigure(‘MicroviewonTDMenvironment’)byFutureTDMcanbereusedundertheCC-BY4.0license. ©2016FutureTDM|Horizon2020|GARRI-3-2014|665940 18 D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE ondataexplorationandisaimedatgeneralizingknowledgeandpatternsbeyondtheexamplesthat arefedtothealgorithmattrainingstage. Informationretrieval(IR)canvaryfromsimpleretrievalofalltheitems(e.g.fulldocuments)withina collection that contain user-requested words in metadata fields or structured databases, to more sophisticatedsearchandrankingoftheresultsbasedonthepresenceofcombinationsofimportant wordsandnumbersindifferentsectionsoftextsordatafields(Baeza-YatesandRibeiro-Neto,1999). It is important to note that, although TDM uses information retrieval algorithms as one of the componentswithintheTDMframework,itgoesbeyondthesimpleretrievalofwholedocumentsor their parts. Within the IR framework, the task of knowledge processing and of new knowledge creation still resides with the user, while in the case of full TDM technology implementation, the extractionofnovel,never-beforeencounteredinformationisthemaingoal(Hearst,1999). Naturallanguageprocessing(NLP)aimsatdevelopingcomputationalmodelsfortheunderstanding andgenerationofnaturallanguage.NLPmodelscapturethestructureoflanguagethroughpatterns that map linguistic form to information and knowledge or vice versa; these patterns are either inspiredbylinguistictheoryorautomaticallydiscoveredfromdata.TheNLPcommunityissubdivided intogroupsthatfocusonspokenorwrittencontent,i.e.speechtechnologies(ST)andcomputational linguistics (CL). ST researchers work with audio signals, synthesising and recognising speech and paralinguisticphenomena.Oncespeechcontenthasbeentranscribedinatextualrepresentation,it canbefurthertreatedastextforTDM.CLresearcherstargetsyntacticanddiscoursestructures;their algorithms help us understand the basic concepts, relations, facts, and events expressed in textual content. ThevisualisationoftheoutputoftheML,IR,andNLPalgorithmsisanintrinsicpartofalltheTDM constituenttechnologies.Visualisationisvaluableforresearchersinfieldslikebioinformaticswhere, for example, the mining of protein behaviour described in a set of publications is visualised in one scheme;orintextmining,whenawordcloud,withwordsindifferentsizeandcoloursdependingon their frequencies, gives a general impression of the most frequent terms used in a collection of documents. Statistics, as a field of mathematics in general and machine learning in particular, provides its practitionerswithawidevarietyoftoolsfordatacollection,analysis,interpretationorexplanation, andpresentation. Classification/Categorisation/Clustering of content within a collection in order to split (in a supervisedorunsupervisedmanner)theseintogroupsofcertaintypesbasedoncontent,representa TDM preprocessing step at the collection level, as these categories are used for further specific miningofthecollection’scontent.Machinelearning(ML)algorithmsrepresentadominantsolution forthistask,e.g.theNaïveBayesapproach(Lewis,1998),decisionrules(Cohen,1995),andsupport vectormachines(Joachims,1998).Whileclassificationalgorithmsidentifywhetheranobjectbelongs toacertaingroup,regressionalgorithmsestimateorpredictaresponseasavaluefromacontinuous set. ©2016FutureTDM|Horizon2020|GARRI-3-2014|665940 19 D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE Association rule-learning algorithms allow for the discovery of interesting relations between variables within large databases (Agrawal et al., 1993). For example, when a large database of customer shopping behaviour is mined, association rules help to define what types of products or services are usually sold together. This information can help to adjust and tailor marketing campaigns,aswellasmakeproductarrangementsinthestoresmoreefficient. Informationextraction(IE)isawideconceptthatcoverstheinferenceofinformationfromtextual data.Examplesoftypesofextractedinformationarestatisticsontheusageofterms,theoccurrence of named entities in text and their relations or associations according to certain classification schemes,andthepresenceofspecificfactsorlogicalinferencesexpressedintexts(e.g.interactions betweengenes,logicalproofs,ortheinferenceoflogicalconsequencesandentailmentexpressedin thetext). Term or Concept extraction is a first step for further IE, as it is necessary to define the terms of importance for the corpus to start its processing. Once these terms, being specific for the topic to conceptualiseagivenknowledgedomain,areextracted,anontologyofthefieldcanbecreatedfor furtherIEtaskimplementations. Negationandmodality(orhedge)detectionisanimportantaspectofTDM,asnegationandhedging constructionsinthetext(suchas“OurresultscastdoubtonfindingXofAuthorZ”or“Wefoundthat findingZdoesnotapplytocontextY”)offercrucialevaluationsofreportedfacts. Sentimentanalysis,oropinionmining,aimstodeterminetheattitudeofaspeakerorawriterwith respecttosometopicortheoverallcontextualpolarityofadocument.Theattitudemaybehisor her judgment or evaluation, the affective state of the author, or the intended emotion the author intendstoconveytothereader. Summarisationofagiventextorasetoftextsisthetaskofcreatingashortertextthatcapturesthe most crucial information from the original texts. First, the crucial content to be summarised is defined,andthen,dependingonthetask,itiscondensedintoanewrepresentationthatmayconsist of sentences or parts from the original documents (extractive summarisation), or that may be a newly created text or multimedia document that rephrases the summarised message (abstractive summarisation). Predictive analytics covers the methods that analyse previous usage or transactions statistics and, basedontheseresultsandtrends,offerspredictionsonfurthersystemsdevelopment. Machine translation targets human quality level of translation of written or spoken content from onelanguagetoanotherusingcomputers.Althoughitoriginallystartedasrule-basedsystemstaking linguistic knowledge into account, nowadays the pioneers in the field use Big Data size collections anddeeplearningalgorithmstotraintheirsystemsandtoachieveperformanceimprovements. Multimediaprocessingallowsfurtherminingofallaspectsofmultimediacontent,varyingfromthe audio stream of a given video to visual concepts extraction and annotation, person and object discovery and localisation on the screen. This direction of research and applications has gained a ©2016FutureTDM|Horizon2020|GARRI-3-2014|665940 20 D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE great deal of importance, due to the fact that contemporary Internet users share and create large quantities of multimedia content that, for example, reflect their opinions and spread information aboutevents. As depicted in Figure 4, the methods listed here are not insulated domains of research and application;theyrepresentacomplexinteractivesetoftoolsandmethodsthatcanberecombinedin multipleways,dependingontheneedsofthespecifictaskandtheavailabilityofdataandexpertise. ©2016FutureTDM|Horizon2020|GARRI-3-2014|665940 21 D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE 3.TDMACROSSECONOMICSECTORS Since the beginning of humanity, the development of novel technologies has defined societal structure and its employment distribution. Agriculture and natural resource extraction form a primary sector, as this is an initial type of human joint-effort to produce food for survival. The inventionofmachinesandelaboratetoolsledtothedevelopmentofthesecondarysector,whichhas evolved into complex machinery engineering, construction, and space exploration. The third sector comprises all the services that society members deliver to each other, varying from retail to transportation, and from medical healthcare support to entertainment. Until recently, this three componentstructurerepresentedmostofthesocietiesintheworld.However,thegrowthofdatadriven business, services and research, has led to a discussion on the changes in the economic structurethatshouldreflectanoveltypeofknowledge-basedeconomy,whereknowledgeanddata become a separate valuable commodity (OCDE, 1996). Thus recently, the research that supports knowledge sharing and growth, as well as education of the professionals that can carry out these activities,havebeentermedaquaternaryeconomicsector.Moreover,thehighleveldecisionmakers in governments, large industry companies, and education, who have the potential to shape the futureoftheentiresectorswiththeirvisionandfollowingdecisions,havebeenplacedinaseparate, quinarysector. Forthepurposeofourreport,weareinterestedinthecurrentandpotentialpresenceofTDMinall economicsectors.WedepicttheserelationsinFigure5.Theprimary,secondary,andtertiarysectors followroughlythesametypeofinteractionwithTDMtechnologies.Thesesectorsrepresentsources of all possible types of data that are available for mining, and at the same time, have tasks and challenges that require smart solutions. The quaternary sector is particularly central to TDM development and implementation, as it produces the TDM experts and advances the technology itself.Expertsatthedecision-makerleveluseTDMtoolstotestandprovetheirvisionofeconomic development, as well as to hypothesise where a specific decision will lead to, and to measure and assessitsprogress. In the sections below, we list examples per sector and subfield on how TDM is used, in order to illustratethebroadnessofTDMuptakeanditspotentialfordevelopment. ©2016FutureTDM|Horizon2020|GARRI-3-2014|665940 22 D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE SECONDARY SECTOR Aerospace Agro-Industry Bio-Technology Construction Technology Engineering Manufacturing Microelectronics Data & Tasks TERTIARY SECTOR PRIMARY SECTOR Data & Tasks Agriculture Coal Mining Technology Natural Resources Industries Wholesale Trade Technology TEXT AND DATA MINING Data & Tasks Across Economic Sectors Industry or Gov. Experts QUINARY SECTOR High Level Decision Makers in Government, Industry, Commerce, and Education Entertainment Finance, Banking, Insurance Health and Social Care IT Services Public Administration Retail/Trade Telecommunications and Journalism Tourism Transportation Technology TDM Experts Tools for Analysis QUARTENARY SECTOR Education Research and Development (e.g. Tech) Figure5.GeneraleconomicstructureandconnectionbetweenTDMandalleconomicsectors.38 3.1.Primarysector Making the best use of limited natural resources with a constantly growing population is one of humanity’s ultimate challenges. Taking on the challenge requires the understanding of changing patternsinclimateandenvironment;TDMhasthepotentialtodiscoversuchpatternsautomatically from data. The primary data that can be used for TDM are collected in the form of satellite measurements,weathersensorsonthegroundandbelow,intheair,andinthewater.Otherdata arederivedfromeconomicindicatorsfromtheprimarysectors:productionnumbers,profits,stock marketvaluepatternsofminedgoodssuchasoil,coal,andgold,andcropssuchaswheat39. 3.2.Secondarysector Anylarge-scalemanufacturingplantordistributednetworkofpipelinesispackedwithvarioustypes of sensors that produce a continuous stream of measurements that are crucial for maintenance analysis,suchastemperature,pressurelevels,leakdetections,etc.Whenanalysedonalargescale, this information offers a basis for discovering and understanding strategies for cost savings and safety measures. For example, calculating optimal routes for product delivery can boost fuel efficiency,leadingtoasmallerecologicalfootprintaswellastocostssavings;knowledgeofelectrical gridusageorgaspipelinesintegrityhelpstoschedulemaintenanceofthemostsensitivepartsofthe 38 This Figure (‘General economic structure and connection between TDM and all economic sectors’) by FutureTDMcanbereusedundertheCC-BY4.0license. 39 http://www.agroknow.com/agroknow ©2016FutureTDM|Horizon2020|GARRI-3-2014|665940 23 D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE system (Mayer-Schoenberger and Cukier, 2013), (Auclair D., 2016); data-driven smart grid applicationscansignificantlylowerCO2emissions(ICSU,ISSC,TWAS,IAP,2015). 3.3.Tertiarysector ThesectorofprovidedservicesembracesTDMatadifferentpacedependingontheTDMexpertise andawarenessinthefield,BIgDataavailability,andthesensibilityofthisdataintermsofprivacy,as muchmorepersonaldataisminedwithinthissector. Companies allocate great effort to interacting with their existing customers, as this allows them to monitorcustomerfeedbackontheproducts,andatthesametimetoprofilecustomersforfurther targetedproductpromotion.Ontheotherhand,informationaboutusers’Internetsurfingbehaviour isgatheredintheformofclicklogs,websiteviewsandsearchqueries.Oncethisdataiscategorised and profiled, the advertising network industry can match the right advertisement from the right company to the suitable client profile. Sentiment analysis of customer comments in global social networks,suchasTwitterandinternetforums,allowscompaniestotrackcustomerinterestintheir products,aswellastousethesechannelsfortargetedadvertisingofservicesandproducts. 3.3.1.Finance,Banking,Insurance Inthefinancedomain,predictiveanalyticsplayacrucialrole.TDMbasedanalyseshelptomanage risks and to make informed decisions. Banks and insurance companies use TDM to build consumer profiles and develop automatic classifiers to determine eligibility for financial products, and to set insurancepremiums.Customerparameterscanincludediversetypesofinformationgatheredfrom different sources, ranging from consumers’ zip codes, employment background and educational history,totheirshoppinghistory,socialmediausageandfriendshipnetwork(RamirezE.etal.,2016). Financial institutions typically purchase this information from consumer reporting agencies (CRA) thatcrawldatafromtheInternetandfromspecialgovernmentandprivatedatabases. 3.3.2.Healthandsocialcare Both health care providers and consumers benefit when treatment plans and medical advice are tailoredtoindividualpatients.TDMhasgraduallybecomepartofthehealthcaresystem:inrouting and treatment tailoring, treatment monitoring, as well as in claims processing and insurance payments(AuclairD.,2016).Inaddition,BigDataisusedbydifferentmedicalinstitutionstopredict the likelihood of hospital readmission for different categories of patients and types of disease combinations. MedicaltreatmentsandpredictionssuggestedbyartificialintelligenceandTDMarebeingintegrated in decision support systems for doctors (e.g. Babylon40 in the UK, IBM Watson41 in the USA). Such systems,poweredbyAI,canhelptolowerthecostsinregionswithashortageofspecialtyproviders. 40 41 http://www.wired.co.uk/news/archive/2014-04/28/babylon-ali-parsa http://www-03.ibm.com/press/us/en/presskit/27793.wss ©2016FutureTDM|Horizon2020|GARRI-3-2014|665940 24 D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE Moreover, medical TDM systems can flag prescription anomalies based on patient records, thus helpingtopreventmedicationerrors(AuclairD.,2016). 3.3.3.Publicadministration Publicadministrationcollectdataaboutdifferentaspectsofthelivesofthemembersofsociety,such asownershipofmovableandimmovableassets,useofthehealthcaresystem,employmentstatus, and family situation. This information is used to keep track of societal developments and changes, and to come up with adjustments to policies and reallocation of services. A substantial amount of thisinformationstaysofflineandisnotinterconnected.Ifthiscontentwouldbemadeopentothe publicthroughopengovernmentinitiatives,manytypesofbusinessesandnon-profitorganisations would have a chance to put this data to use, e.g. by connecting the PSI data to their own data, or building new products and services on top of the released PSI (Mayer-Schoenberger and Cukier, 2013). Likethefinanceindustry,publicadministrationcanbenefitfrominternalTDMimplementations,as intelligent mining and automated auditing of the data can help to detect and to prevent fraud, currently leading to financial losses in tax collection; TDM may generally help to optimise the workflow. 3.3.4.Retail Inretail,trackingandminingofpurchasetransactionsbycustomersallowscompaniestoadjustthe range of products on display according to the preferences of the customers in a particular area.In additiontobetterfittedproductselection,foodretailerscanadjusttheirfacilitiesarrangements,for example,refrigeratortemperatures,basedonlarge-scalestatisticsofalltheshopsinachain,which inturnhelpstosaveontheenergybills. Somemarketplacesaresetupassalesplatformsforindividualsandcompaniesincompletelyvirtual environments,likeinternationalgiantseBay42andAmazon43ortheFrenchsmallersizeequivalent,Le BonCoin44.Inthecontextofonlineshopping,theminingofuserbehaviour(clicks,productsselected orbought,timespentonthepage,browsersused,timeofthedaywhenthewebsiteisaccessed)is increasingly being used for product recommendations and reminders, which eventually increases sales. 3.3.5.Telecommunicationsandjournalism Thetelecommunicationsindustrygeneratessomeofthelargestmultimediadatasets,andTDMhelps to filter and to play already available content that is of interest to clients depending on their preferences. Mining customer preferences, in turn, influences decisions on allocating resources to thecreationofnewcontentthatwillpotentiallybesuccessful. 42 http://www.ebay.com 43 https://www.amazon.com 44 http://www.leboncoin.fr ©2016FutureTDM|Horizon2020|GARRI-3-2014|665940 25 D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE Journalists always needed diverse and reliable information sources to report news events and to bring an analytical perspective to a general audience. Nowadays, journalists have to deal with an avalancheofdatasources,thusminingisseenasavaluabletoolforinvestigativejournalismtofind, confirm,andeventuallytoprovestoriesinamoretransparentway(DiakopoulosN.,2014).Miningof blogs,websitesandtwitterfeedsmakesjournalistsawareoftheiraudience’sinterestsandlevelof knowledge, which helps them to shape their social interaction accordingly and to select relevant news pieces (Flaounas I. et al., 2013). At the same time, the journalists have to be careful when redefiningtheiragendausingTDM.TheyneedtocumulatetheknowledgegainedfromTDMwiththe actualstateofsociety,asnotallsocietymembersarerepresentedinthesocialmediaorwebdata availableforcrawling.Intheirworkanduseofdata,journalistshavetofindabalancebetweendata personalisationandtheecologyofcommonknowledge(LewisS.C.andWestlundO.,2014). 3.4.Quaternarysector This information and knowledge-rich sector focuses on knowledge gathering, processing, and creation, via research practices and teaching society members. In this sector, TDM algorithms are boththesubjectofresearchandasetoftoolsthatenableresearch. 3.4.1.Research The speed of scientific progress depends to a large extent on the framework that supports the sharing of novel ideas and key findings within and across communities, the reproducibility and comparabilityofresults,andtheresearchimpactmeasurements.TDMtechniqueshavethepotential toaddvalueandincreaseefficiencyforallthesecomponents. AsmentionedinSection1andillustratedwithexamplesinFigures1-2,researchersnowadayshave tokeeptrackofanenormousnumberofpublicationstoretainanup-to-dateunderstandingofthe research in their field. Moreover, a growing amount of research happens at the intersection of severaldomains,ormethodsarebeingadoptedacrossdomains.Thus,thecurrentvolumesofcrossdomain content render them infeasible to be read by humans. TDM can help researchers discover newconnectionsandtrends,allowingthemtomoreoptimallyusetheirtimeandotherresources. Thequalityofscientificresultssharedwithinthescientificcommunityreliesandheavilydependson thesystematicreviewingofsubmittedarticlesforpublication.Thisrigorousprocessguaranteesthat theresearchoutcome,whenaccepted,bringsnovelinformationtothefieldofknowledge,andthe results have been achieved following the standard procedures that safeguard reproducibility and reliability.Ideally,highqualityreviewingreliesonfullawarenessofallthepublicationsinthefield. However,withtheincreasingnumberofpublishedstudiesatagloballevel,itbecomesunfeasiblefor anyhumanresearcher/reviewertokeeptrackofallstudiesthathavebeenreported.Therefore,TDM canprovidevaluablesupportforreviewers,asitcanreducetheworkloadandminimiseanypotential bias. ©2016FutureTDM|Horizon2020|GARRI-3-2014|665940 26 D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE Furthermore,thequalityofpublishedresearchcanbemeasuredwithusingTDM.Forexample,smart mining of publication references that takes into account the discourse structure in case of each quotation,yieldsamoresubtleandinformativeoverviewofthewholecitationsnetwork. Clinicalstudiesrequireameaningfulsamplesizeofpatientswithcertainconditionsinordertotest theeffectivenessofnewdrugsormedicalprocedure.Pharmaceuticalcompaniesthatdevelopthese drugsandtreatmentsrelyonintermediarycompaniesthatcarryouttheactualscreeningofmedical recordsforsuitablepotentialcandidates.TheseintermediariesworkonalargescaleandrelyonTDM technologiestoestablishpatient-clinicalmatches,aswellastosupportpost-trialpharmacovigilance throughearlydetectionofadversedrugevents. 3.4.2.Education Traditional educational institutions can use BIg Data techniques to identify students for advanced classes, while at the same time determine students who are at risk of dropping out or might be in needofearlyinterventionstrategies.TheGatesFoundationhasinvestedinanumberofprogrammes to improve the quality and availability of educational data with the aim of developing policy and practicetoimprovetheimpactofeducation,andtobooststudentachievement45. Massive open online courses (MOOC), introduced in the late 2000s, are currently reshaping the wholeconceptofdistanceeducation,andhaveapotentialtoimpactthewholeeducationalsystem, as they make it possible to monitor and assess individual class performance based on large-scale miningofstudentprofiles,attendance,taskcompletion,andforumdiscussions. 3.5.Quinarysector Funders and high-level decision makers rely on their general visions of technology and societal development paths, as well as the impact-metrics that assess potential economic, political, and societal value. These experts can use TDM to mine the current status of affairs, and to predict the outcomes of diverse types of interventions. Moreover, sentiment analysis of social networks activitiescanprovidethemwithfastfeedbackfromthepopulationonactionstaken. 45 http://www.gatesfoundation.org/Media-Center/Press-Releases/2009/01/Foundation-Invests-in-Researchand-Data-Systems-to-Improve-Student-Achievement ©2016FutureTDM|Horizon2020|GARRI-3-2014|665940 27 D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE 4.SUMMARYOFTHELITERATURE Astextanddataminingharbouralargepotentialacrosspracticallyallsectorsoftheeconomy,their legal,economic,andpoliticalimplicationshavebeendiscussedinseveralexpertreportsandfromthe pointofviewofvariousstakeholders.Thisreportisbasedonseveralofthesereports;inthissection we summarise the most prominent reports published in recent years for further reading and reference.Wehighlightthepointswhichcallforurgentlegalandeconomicchangesthatwouldallow TDMtounleashitsfullpotential.Wealsohighlightcaseswhereuptakesuccessismentioned,andwe outlinethedirectioninwhichdiscussionsdevelop.Figure6depictsthegeneraltimelineofreports, their research questions and focus (Legal aspects, Economic factors, Examples, Technology). It is worthnotingthatreportscreatedonbehalfofgovernmentsandacademiaaremoreoftenreleased via open access, even when produced by private companies. Therefore, they are available for analysis, while private sector reports, usually created as a product for sale, stay behind a paywall, restrictingouraccesstotheircontent. ©2016FutureTDM|Horizon2020|GARRI-3-2014|665940 28 D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE ©2016FutureTDM|Horizon2020|GARRI-3-2014|665940 29 D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE 46 Figure6.Timelineofreports. Year Title Authors Researchquestions/ Focus 2011 DigitalOpportunity.A ReviewofIntellectual 47 PropertyandGrowth. I.Hargreaves Thisreportaimsto Basedonanalysisofprevious findeconomic publications,reports,reportedlegal evidencetosupport andeconomicpractices. changesinthe IntellectualProperty lawsatthelevelofthe UK,andfurtherata unifiedEUlevel. TwoimportantareasofutilityofTDM:(1)costsavings,(2) widereconomicimpactgeneration 2011 BIgData:Thenext frontierforinnovation, competition,and 48 productivity. McKinseyGlobal Institute(MGI) ToexploreBigData potentialwhenusedin everyeconomysector, itsinnovation potential,andthe addedvalue. Basedonanalysisofprevious publications,reports,reportedlegal andeconomicpractices,aswellas internalMGIresearch. ThisreportgivesageneralperspectiveontheBigDataTDM beingincorporatedintoalleconomicsectors,asdatafication reachesallsectorsoftheglobaleconomy.Theexamplesof potentialprofits,aswellasraisingissues,arelistedforbothEU andUSAmarkets. -Issues:datapoliciesinthecontextofBigData,privacy protection;storageandaccessfacilitiesandnoveltechniques thatneedtobeimplemented. -BigDataanditsprocessingcreateadditionalvaluefor businessesinanumberofways:improvedandmoreuser targetedservicequality,overalltransparencyofperformance, andthesupportofhumandecisionswithreliableanalytics. TheValueandBenefits D.McDonald,U. Toexplorecosts, Consultationswithstakeholdersand -Theavailabilityofmaterialfortextminingislimited.Evenin 2012 J.Manyika,M. Chui,B.Brown, J.Bughin,R. Dobbs,C. Roxburgh,A. HungByers Dataandmethods Keyfindings 46 ThisFigure(‘Timelineofreports’)byFutureTDMcanbereusedundertheCC-BY4.0license. https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/32563/ipreview-finalreport.pdf 48 http://www.mckinsey.com/business-functions/business-technology/our-insights/big-data-the-next-frontier-for-innovation 47 ©2016FutureTDM|Horizon2020|GARRI-3-2014|665940 30 D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE ofTextMiningtoUK FurtherandHigher Education.Digital 49 Infrastructure.JISC. Kelly benefits,barriersand risksassociatedwith textminingwithinUK furtherandhigher education casestudies(literaturereviewin systemsbiology,textminingto expediteresearch,increaseof accessibilityandrelevanceof scholarlycontent. caseofearlyadoptersinthefieldsofbiomedicalchemistry research,theminingislimitedtotheOpenAccessdocuments. -Mainbenefitsareincreasedresearchers’efficiency,improved researchprocessandquality.Thereportdemonstrateshow muchmoneyintermsofpersontimecanbesavedorusedfor properresearchtaskswhenthescientificcontentismined automatically,leavingtheresearcherswithanalysischallenges andpureresearchquestions. -Barriers:legaluncertainty,inaccessibleinformation. 2012 Businessintelligenceand MISQuarterly Analytics:FromBigData 50 toBigImpact. Themainfocusofthis -Bibliometricstudyofcritical revueistodescribe BusinessintelligenceandAnalytics anddiscussBusiness publications intelligenceand Analytics(BI&A) implementedfor diversemarket applicationsthatuse manydatatypes(from user-generated contenttoFederal ReserveWireNetwork information) -BI&Ainitscurrentstate(2.0)processesweb-based unstructureddata,andthenextstepforthetechnology(3.0) willbetousemobileandsensor-basedcontent. 2012 ResearchTrends.Special 51 IssueonBigData. Thisspecialissuehasa -- broadperspective varyingfrom applicationsscenarios -Bibliometricsanalysisconfirmsthat‘ComputerScience’and ‘Engineering’aretheleadingtopicsinBigDatapublications, andUSleadsthepublicationsraceinthefield. 49 https://www.jisc.ac.uk/reports/value-and-benefits-of-text-mining http://hmchen.shidler.hawaii.edu/Chen_big_data_MISQ_2012.pdf 51 http://www.researchtrends.com/wp-content/uploads/2012/09/Research_Trends_Issue30.pdf 50 ©2016FutureTDM|Horizon2020|GARRI-3-2014|665940 31 D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE toinfrastructureissues forTDMinBigData context 2012 META-NET.Strategic ResearchAgendafor MultilingualEurope 2020.52 2013 2014 META Technology Council Thisreportfocuseson thecurrentstateof andthefurther potentialfor developmentof languagetechnologies onaglobalscale,and morespecifically,the Europeanlandscape. e-IRGWhitePaper. e-IRG Thisreportfocuseson theinfrastructure initiativesthathave alreadybeentaken andespeciallyon thosethatareurgently neededinorderto supportTDMona largescaleacross communities. -InfrastructureneedssupportatnationalandEUlevel,and requirescollaborationwithbusiness. -Openscienceshouldbesupportedthroughandviashared infrastructures. Bridgingthegap.How digitalinnovationcan drivegrowthandcreate P.Hofheinz,M. Mandel -Thisdocument outlinesthegrowthof dataataglobalscale, -Largerrevolutioninalleconomicsectorsandaspectsof societallifearepredictedtobeduetotheBigDataera,but mostimportantlythesewillbeshapedbydataanalyticsforthis 53 -MultilingualismisaanimportantfeatureofEurope,asit meansthatthereisneedforthediversificationandlocalisation ofthetoolsthatwillbeabletoprovideservicesatthesame levelforthewholevarietyofEuropeanlanguagesand customers. Basedonanalysisofprevious publications,reports,reportedlegal andeconomicpractices. 52 53 http://www.meta-net.eu/vision/reports/meta-net-sra-version_1.0.pdf http://e-irg.eu/documents/10920/11274/annex_5.2_e-irg_white_paper_2013_-_final_version.pdf ©2016FutureTDM|Horizon2020|GARRI-3-2014|665940 32 D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE 2014 jobs.LisbonCouncil 54 SpecialBriefing. andaspectsofdatadriveneconomylike betterconditionsfor creatinganinclusive economyforall societymembers, betterpotentialfor cheaperindividual education,and trainingforadvanced manufacturing. -Theauthorsfocuson the‘datagap’ betweenEuropeand NorthAmerica. -Strongfocusisgiven tothediscussionon policyregulationsthat needtobeadjusted acrosstheAtlantic, bothintheEUandthe USA. MappingTextandData S.Filippov MininginAcademicand ResearchCommunitiesin Europe.LisbonCouncil 55 SpecialBriefing. -Thisreportcontinues theabove-mentioned workongathering informationonthe importanceofTDMfor theglobaleconomy data. -EuropeisbehindcountrieslikeSouthKorea,US,Canada,in termsofdatause. -Therearesuccessfulexamplesofopenaccesspolicy introductionforpublicdata.InEstonia,thestatereleased publiclyowneddataincludingutilitiesandcellphonenetworks afterhavingstrippedprivateinformationandanonymization, whichhadpositiveimpactonthedevelopmentofdata-driven businesses. -TheauthorsarguethatiftheEuropeandataprotection statutesarenotlightened,thecostofdatausagewillonlyrise, whichconsequentlywillwidenthegapbetweenEuropeand theUSA,aswellastherestoftheworld. Basedonanalysisofprevious publications,reports,reportedlegal andeconomicpractices,aswellas interviews,andbibliographyand patentbasedanalysison ScienceDirectdatabase. -Thedifferenceincopyrightaccessacrossthecountriesis reflectedbythenumberofpublicationsontextanddata miningsubjects.TheUSisleadingthefield,whiletheonly EuropeancountryatthetopofthelististheUK,whichhasan exceptionforcontentminingforacademicpurposes. -ThenumberofpublicationsonTDMinEnglishisseveral 54 55 http://www.progressivepolicy.org/wp-content/uploads/2014/04/LISBON_COUNCIL_PPI_Bridging_the_Data_Gap.pdf http://www.lisboncouncil.net/publication/publication/109-mapping-text-and-data-mining-in-academic-and-research-communities-in-europe.html ©2016FutureTDM|Horizon2020|GARRI-3-2014|665940 33 D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE development,and focusesonthereasons whyEUlagsbehind theUSandsomeAsian countriesinthis domain,and hypothesiseshowthis canchangeeither positivelyor negatively,depending onthemeasures taken. 2014 2014 Standardisationinthe areaofinnovationand technological development,notablyin thefieldofTextand DataMining.Report fromtheExpertGroup, 56 EuropeanCommission. ExpertGroup Assessingtheeconomic impactsofadapting certainlimitationsand exceptionstocopyright andrelatedrightsinthe EU.Analysisofspecific 57 policyoptions. CharlesRiver Associates(CRA) ordershigherthaninanyotherlanguages,covering97%ofthe publications. -TDMispatentedasalgorithmsintheUSandother jurisdictions,whileinEuropeTDMispatentedasbeing embeddedinthesoftware.Chinaisrisingstronglyinthis marketintermsofpatents. -Thisreportliststhestakeholdertypesandthelegaland economicissuesthattheyencounterbothintheEUandinthe restoftheworld. -Theexpertshypothesiseonhowthevaluechainmightbe adjustedwhenTDMisfullyincorporatedandlegallyallowed, arguingthatitshouldhaveamorepositiveimpactoverall.The dataiscategorisedaccordingtobothsizeandlegalaccess pattern. -IntroductionofIPexceptionsforthecaseofTDMforscience shouldnotsignificantlyadverselyaffectrightholders’incentives forcontentcreation,contentqualityandTDM-specific investments. I.Hargreaves,L. Guibault,C. Handke,P. Valcke,B. Martens,R. Lynch,S.Filippov Thisreportfocuseson economicimpactsof specificpolicyoptions J.Boulanger,A. inseveraltopicsof Carbonnel,R.De interest(digital preservationofand Coninck,G. remoteaccessto Langus. 56 57 http://ec.europa.eu/research/innovation-union/pdf/TDM-report_from_the_expert_group-042014.pdf http://ec.europa.eu/internal_market/copyright/docs/studies/140623-limitations-economic-impacts-study_en.pdf ©2016FutureTDM|Horizon2020|GARRI-3-2014|665940 34 D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE culturalheritage,elendingforlibraries, TDMforscientific purposes),inviewof providingpolicy guidanceonthese topics. 2014 Studyonthelegal frameworkoftextand 58 datamining(TDM). DeWolf& Partnersforthe European Commission Thefocusisonthe legalaspectsofTDM. Dataanalysisisthemostencompassingtermandismore preferableforuseinalegalcontext. Classificationof4differentlevelsofaccess:“alltoall”(webdate)/“manytomany”(socialnetworks)/“onetomany” (contractualclauses)/“onetoone”(confidentialagreements) 2014 TheDataHarvest:How sharingresearchdata canyieldknowledge, 59 jobsandgrowth. RDAEurope Thereportfocuseson themeasuresthat shouldbetakenin ordertosupport researchusingBig Data. -Mainincentivesofthereportinclude: --dataplanmanagement; --promotionofliteracyintermsofalltheaspectsofBigData useacrosssociety; --developtoolsandpoliciesthatsupportdatasharing. -Thisreportaimsto providepolicymakers andanalystswiththe meanstocompare 2015 F.Genova,H. Hanahoe,L. Laaskonen,C. Morais-Pires,P. Wittenburg,J. Wood OECDScience, TechnologyandIndustry Scoreboard2015. INNOVATIONFOR Organisationfor EconomicCooperationand Development, -Thisisabroadlyscopedreportthathassectionsthatdiscuss trendsandfeaturesofknowledgeeconomies. -ItgivesperspectivesontheoverallscientificexcellenceofEU countriescomparedtotherestoftheworldacrossallfieldsof 58 59 http://ec.europa.eu/internal_market/copyright/docs/studies/1403_study2_en.pdf https://rd-alliance.org/sites/default/files/attachment/TheDataHarvestFinal.pdf ©2016FutureTDM|Horizon2020|GARRI-3-2014|665940 35 D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE 2015 GROWTHAND 60 SOCIETY. Europe Skillsofthedatavores. TalentandtheData 61 revolution. Nesta,UK J.MateosGarcia,H. Bakhshi,G. Windsor economieswithothers ofasimilarsizeorwith asimilarstructure, andtomonitor progresstowards desirednationalor supranationalpolicy goals. -Themainfocusof thisreportisonthe knowledge-based economies’ developmenttrends. Introducesitsown classificationofthe companiesthatwork withBigData,and thenanalysesthe typesofskillsdata scientistsand engineershaveto learnandmasterto boosttheeconomy andscientific development. Theanalysisfocuses ontheUKmarket whilethecompany science,andtheamountofinvestmentindifferentdomains, withspecialhighlightsonthenewgenerationofICT-related, BigDatabasedtechnologiesandpatents. -Acrossmanymetrics,Europeancountrieswhencombinedare comparabletotheresultsofUSA,Japan,China,SouthKorea. InternalNESTAanalysisofthedata analyticsmarketandhiring strategiesandchallengesinthefield thatarebasedonmorethan50indepthinterviews(inthefieldsof manufacturing,retail,ICT,creative media,financialservices, pharmaceuticals),collaboration withCreativeSkillset. 60 61 http://www.oecd.org/science/oecd-science-technology-and-industry-scoreboard-20725345.htm https://www.nesta.org.uk/sites/default/files/skills_of_the_datavores.pdf ©2016FutureTDM|Horizon2020|GARRI-3-2014|665940 -Classificationofcompaniesconnectedtodataprocessing: --‘Data-active’companiesthatareeitherdata-driven (Datavores),workwithlargevolumesofdata(DataBuilders),or combinedatafromdifferentsources(DataMixers), --Dataphobes:othercompaniesthatrestrictthemselvesto smallsizedatasets. -Thesedifferenttypesofcompanieshavedifferentbehaviours whilesettinguptheirrequirements,andthenhiringthedata experts.Forexample,datavoresanddatabuildersrecruitmost oftheircandidatesfromComputerSciences,Engineering,and Mathematics,whileBusinessdisciplinesarethesourcefor managementanalyticspositions. 36 D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE classificationcanbe usedglobally. 2015 BriefingPaper.Textand DataMiningandthe NeedforaSciencefriendlyEUCopyright 62 Reform. ScienceEurope Summarisesthe Basedonanalysisofprevious argumentsinthe publications,reports,reportedlegal debateforcopyright andeconomicpractices. lawreformthatshould ensurethatlegallyaccessedcontent couldbefreelymined withoutadditional permissionandcost withintheEU. -EUregulationsfordatabaseprotection(Directive1996/9/EC) seriouslyimpederesearcherswhoaimtoaccessthecontentto beminedviaadatabase,asdownloadingasubstantialpartof thedatabasecontentneedstobeauthorisedbythedatabase owner,evenifnoneofthecontentisundercopyright. -Scientificpublishersstarttointroducelicencesregulatingthe miningofsubscribedcontent,whichfurtherlimitsthe researchersintheirimmediateinteractionwiththecontent. -ExceptionsincopyrightlawincountrieslikeUK,Japan,USA, whichallowtheirrepresentativestoavoiddifficultieswiththe legalityofcontentmining,whichputsEUresearchersata disadvantage,andcomplicatespotentialforinternational collaborations. -Theauthorssuggestthefollowingstepsforresearch organisations: --AvoidrequestingthattheirresearchersabstainfromTDM; --RefusetosignTDM-licensesaspartofsubscriptioncontracts, ornegotiatemorefavourablecondition; --Empowertheiremployeesinternallythroughtrainingand TDMsupport,andexternallybyadvocatingcopyrightchanges atnationalandEuropeanlevel. 2015 IsEuropefallingbehind inDataMining? Copyright’sImpacton C.Handke,L. Guibault,J.-J. Vallbe Thispaperfocuseson thecopyright arrangements DemonstratesthatresearchersfromtheEU/EEAcountrieshave weakerperformanceintermsofpublicationsinthegrowing fieldofdatamining.Thisispartiallyduetothecomparatively Theauthorsmeasureresearch outputinthenumberof publicationsinacademicjournal 62 http://www.scienceeurope.org/uploads/PublicDocumentsAndSpeeches/WGs_docs/SE_Briefing_Paper_textand_Data_web.pdf ©2016FutureTDM|Horizon2020|GARRI-3-2014|665940 37 D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE DataMininginAcademic 63 research. 2015 ScienceInternational: OpenDatainaBigData 64 World. impactingdatamining byacademic researchers. International Councilfor Science(ISCU), International ScienceCouncil (ISSC),The WorldAcademy ofSciences (TWAS), InterAcademy Partnership(IAP) articlesinThomsonReuters’sWeb ofScience(WoS)database. -Thisreportfollowsan Basedonanalysisofprevious accordofInternational publications,reports,reportedlegal Scientificorganisations andeconomicpractices. andfocusesonthe opportunitiesthatthe BigDatareality represents. -Itdiscusseshowtext anddataminingofthis scientificcontentona massscalecanhelp whenthescience addressesglobal challengessuchas infectiousdisease, energydepletion, migration, sustainabilityandthe operationoftheglobal economyingeneral. 63 64 http://elpub.architexturez.net/system/files/15_HANDKE_Elpub2015_Paper_23.pdf http://www.icsu.org/science-international/accord/open-data-in-a-big-data-world-short ©2016FutureTDM|Horizon2020|GARRI-3-2014|665940 strongcopyrightprotectionlawsinthesecountries. -Unprecedentedopportunitiesforscientificdevelopmentin currentdigitalagerelyon2pillars:BigData(largevolumesof complexanddiversedatastreaminginrealtime)andLinked Data(separatedatasetslogicallyrelatedtoaparticular phenomenonallowcomputerstoinferdeeperrelationships). -Forthescientists,theopenaccesstothescientificdataand publicationsisamuch-neededimperative.Therefore,authors insistonderegulatingopendataforscientificdatausage.Data shouldbe“intelligentlyopen”,i.e.discoverableontheWeb, accessibleandintelligibleforfurtheruse,andassessablein termsofdataproducers’quality.Thisopennesswillalsohelp replicabilityteststhattestandscrutinisethereportedresults. -Theauthorsnotetheongoingdiscussionwhetherpublicdata shouldbefreelyavailabletoeveryone,orjusttothenon-profit sector.Theyarguethatitisprimarilynotappropriateor productivetointroducethisdiscriminationindataaccess,and moreimportantly,robustevidenceisaccumulatingtosupport theclaimthatthisaccesstodataforeveryonebringsdiverse benefits,andbroadereconomicandsocietalvalue. -Theauthorsgiveexamplesofcross-countrycollaborationsin openaccessinitiativesinSouthAmericaandAfrica,while notingthatsharingdataallowsmoreuniformdevelopmentof technologiesandtheirimplementationonaglobalscale. 38 D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE 2016 TextandDataMining: TechnologiesUnder 65 Construction. Outsellmarket performance report,USA D.Auclair 2016 BigData.AToolfor InclusionorExclusion? Understandingthe 66 Issues. FederalTrade Commission (FTC)Report, USA E.Ramirez,J. Brill,M.K. Ohlhaussen,T. McSweeny -Focusofthisreportis onTDMaspectsfor scientificresearch, healthcare,and engineering. Theauthorsbasethereporton approximately20in-depth interviewswitharangeof stakeholderslikecontentvendors andtechnologyandtoolproviders, aswellasonpreviouslypublished information. -Theauthorshighlightthefactthatdatamininghasbeenused longerformarketresearchandcommercialactivities,as correspondingcommercialwebsitescontainlargequantitiesof structureddata,whileacademiaandindustryhavetodealwith highvolumeofunstructuredtextsanddata,whichmadetheir progressinuptakeslower. -ThereportprovidesalistofprovidersofTDM-related functionsandexamplesoftypesofbusinessthattheyare serving. -Theauthorslistanumberofactionsforcontentprovidersand TDMtoolvendorstofollow,thatcouldhelpthemtobenefit fromTDMuptakeinthenearfuture: --TDMsolutionsshouldbescalable --Storeddatashouldfollowthestandardsatcollectionand storagestages. --Privacyandsecurityregulationsaretobeensured. -Thisreportargues thatitisnota questionofwhether companiesshoulduse BigDataanalytics,but howtoitthat correctly,minimising legalandethicalrisks. -Highlightsneedto identifyandavoid potentialbiasesand inaccuraciesinTDM FTCheldapublicone-dayworkshop ‘BigData:AToolforInclusionor Exclusion?’Inordertogeta feedbackfrominvolved stakeholdersonthepotentialofBig Dataintermsofcreated opportunitiesforconsumersand policyissues,namelysecurity breaches,andbiasesand inaccuraciesinproper population/eventsrepresentationin thedata. -Thereportincludesalistofquestionsthataretobe consideredandinvestigatedwhencompaniesareusingBig Dataanalyticsintheirbusinessmodels.Thisprocedurecould helpthemtoavoidviolatinglawsintermsofdiscrimination, andtopreventerrorsthatmightoccurduetotrainingdataset inconsistenciesthatcouldleadtopotentialbiasesinoffersand services. Listofquestions: -Howrepresentativeisyourdataset? -Doesyourdatamodelaccountforbiases? -HowaccurateareyourpredictionsbasedonBigData? -DoesyourrelianceonBigDataraiseethicalorfairness 65 66 http://www.copyright.com/wp-content/uploads/2016/01/Outsell-Market-Performance-22jan2016-Data-and-Text-Mining-Licensed.pdf https://www.ftc.gov/system/files/documents/reports/big-data-tool-inclusion-or-exclusion-understanding-issues/160106big-data-rpt.pdf ©2016FutureTDM|Horizon2020|GARRI-3-2014|665940 39 D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE concerns? applications,sothat theBigDataisusedto createnovel opportunitiesfor societal improvements,and especiallytosupport low-incomeand underserved communities. 2016 EuropeanParliament resolutionon‘Towardsa thrivingdata-driven 67 economy’. EUParliament J.Buzek onbehalfofthe Committeeon Industry, Researchand Energy Thisdocumentaimsto structurethe knowledge-based economylandscape, withafocusonthe EU,andtooutline whatareurgent measuresneedtobe takeninordertobe theleadersinthefield ataglobalscale. Basedonanalysisofprevious publications,reports,andinternally producedfinancialestimationsof themarketpotential. 67 http://www.europarl.europa.eu/sides/getDoc.do?type=MOTION&reference=B8-2016-0308&language=EN ©2016FutureTDM|Horizon2020|GARRI-3-2014|665940 -ThisresolutionoutlinesBigDataanddata-driveneconomy potentialtill2020,withafocusontheEuropeanissuesand needs. -AnnouncementoftheEuropean‘FreeFlowofData’initiative -StressestheneedforamodernICTinfrastructureata Europeanscale. 40 D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE 5.CONCLUSION Theuptakeoftextanddataminingisgrowingacrossallsectorsofknowledge-basedeconomies,as thestrengthsofTDManditseconomicadvantages,suchastimesavingandscaling,areincreasingly wellrecognised(ICSU,ISSC,TWAS,IAP,2015),(AuclairD.,2016).WhilethedevelopmentofTDMby the larger IT companies is directed at global markets, there is a long tail of TDM research and development in industry aimed at specific markets and niches. Most TDM industries makes use of provenTDMtechnologies,suchassupervisedmachine-learningalgorithmsforclassificationorrisks assessment(Ittooetal.,2016),(FleurenandAlkema,2015),howeverthereisahealthyrelationship between industry and academic research on the development and uptake of new methods from fieldslikeArtificialIntelligence(e.g.deeplearningalgorithms). Ingeneral,industrieshavenotroublefindingaccesstodata,andmanagetouseTDMforeconomic profit. However, different types of barriers still occur. If barriersto the availability of full texts and datasetsfromacademicdomainsandpublicsectorinformationwerelifted,thefirstapplicationsthat couldbemadepossiblewouldfinallyreviveoldvisionsfromthesecondhalfofthe20thcenturyon accomplishing automatic unconstrained scientific discovery from observational data (Simon et al., 1981),(Simonetal.,2007). Large-scaleglobalcompanieslikeAlphabet(Google),IBM,andMicrosoft,currentlyinvestheavilyin TDM algorithms and infrastructure development, as much of the value they accrue nowadays is leveragedfromthedataprovidedtothembytheircustomers,incombinationwithallthecrawlable content on the web. For companies and research institutes based in the European Union, a competitivestrategynecessarilyinvolvesawideruptakeofTDM,basedonabetterawarenessofthe potentialofTDMacrossalleconomicsectors. ©2016FutureTDM|Horizon2020|GARRI-3-2014|665940 41 D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE REFERENCES Agrawal, R., Imielinski, T., Swami, A., 1993. Mining association rules between sets of items in large databases, in: Proceedings of the 1993 ACM SIGMOD International Conference on ManagementofData.ACM,Washington,D.C.,USA,pp.207–216. Auclair D., 2016. Text and Data Mining: Technologies Under Construction. (Outsell market performancereport). Baeza-Yates, R.A., Ribeiro-Neto, B., 1999. Modern Information Retrieval. Addison-Wesley Longman PublishingCo.,Inc.,Boston,MA,USA. Cohen, W.W., 1995. Learning to classify English text with ILP methods. Advances in inductive logic programming32,124–143. DiakopoulosN.,2014.Algorithmicaccountability.Journalisticinvestigationofcomputatiionalpower structures.DigitalJournalism3,398–415. Filippov,S.,2014.MappingTextandDataMininginAcademicandResearchCommunitiesinEurope. FlaounasI.,AliO.,Landsdall-WelfareT.,DeBieT.,MosdellN.,LewisJ.,CristianiniN.,2013.Research methodsintheageofdigitaljournalism.Massive-scaleautomatedanalysisofnews-contenttopics,style,gender.DigitalJournalism1,102–116. Fleuren,W.W.M.,Alkema,W.,2015.Applicationoftextmininginthebiomedicaldomain.Methods 74,97–106.doi:10.1016/j.ymeth.2015.01.015 Hargreaves I., Guibault L., Handke C., Valcke P., Martens B., Lynch R., Filippov S., 2014. Standardisationintheareaofinnovationandtechnologicaldevelopment,notablyinthefield ofTextandDataMining.ReportfromtheExpertGroup. Hearst, M.A., 1999. Untangling Text Data Mining. College Park, Maryland, Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics(ACL’99)3–10. ICSU, ISSC, TWAS, IAP, 2015. Science International: Open Data in a Big Data World. International Council for Science, International Science Council, The World Academy of Sciences, InterAcademyPartnership. Ittoo, A., Nguyen, L.M., van den Bosch, A., 2016. Text analytics in industry: Challenges, desiderata andtrends.ComputersinIndustry.doi:10.1016/j.compind.2015.12.001 Joachims,T.,1998.TextCategorizationwithSuportVectorMachines:LearningwithManyRelevant Features,in:Proceedingsofthe10thEuropeanConferenceonMachineLearning.SpringerVerlag,pp.137–142. Lewis,D.D.,1998.Naive(Bayes)atForty:TheIndependenceAssumptioninInformationRetrieval,in: Proceedingsofthe10thEuropeanConferenceonMachineLearning.Springer-Verlag,pp.4– 15. Lewis S.C., Westlund O., 2014. Big data and journalism: Epistemology, expertise, economics, and ethics.DigitalJournalism3,447–466. Mayer-Schoenberger,V.,Cukier,K.,2013.BigData:ARevolutionThatWillTransformHowWeLive, WorkandThink.JohnMurrayPublishers. MotionforaresolutionfurthertoQuestionforOralAnswerB8-0116/2016pursuanttoRule128(5)of theRulesofProcedureon“Towardsathrivingdata-driveneconomy”(2015/2612(RSP),2016. OCDE, 1996. The knowledge-based economy. Organisation for Economic Co-operation and Development,Paris. RamirezE.,BrillJ.,OhlhaussenM.K.,McSweenyT.,2016.BigData.AToolforInclusionorExclusion? UnderstandingtheIssues.(FederalTradeCommission(FTC)Report). ©2016FutureTDM|Horizon2020|GARRI-3-2014|665940 42 D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE Simon, H.A., Langley, P.W., Bradshaw, G.L., 1981. Scientific discovery as problem solving. Synthese 47,1–27.doi:10.1007/BF01064262 Simon,Uren,V.,Li,G.,Sereno,B.,Mancini,C.,2007.Modelingnaturalisticargumentationinresearch literatures:Representationandinteractiondesignissues:ResearchArticles.Int.J.Intell.Syst. 22,17–47.doi:10.1002/int.v22:1 Swanson, D.R., 1986. Fish oil, Raynaud’s syndrome, and undiscovered public knowledge. Perspect BiolMed30. vanHaagen,H.H.H.B.M.,’tHoen,P.,Bovo,A.B.,deMorrée,A.,vanMulligen,E.,Chichester,C.,Kors, J.,denDunnen,J.,vanOmmen,G.J.B.,vanderMaarel,S.,Kern,V.M.,Mons,B.,Schuemie, M., 2009. Novel protein-protein interactions inferred from literature context. PLoS ONE 4. doi:10.1371/journal.pone.0007894 ©2016FutureTDM|Horizon2020|GARRI-3-2014|665940